-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change MultiIndex repr ? #13480
Comments
It would be nice to have the possibility to change the representation through |
Any thoughts on the desired behaviour @jorisvandenbossche? I think that tuples are a good choice. The tuple output is also close to the workaround many people use: In [15]: x = numpy.arange(10000)
In [16]: pandas.MultiIndex.from_arrays([x, x]).values
Out[16]:
array([(0, 0), (1, 1), (2, 2), ..., (9997, 9997), (9998, 9998),
(9999, 9999)], dtype=object) |
I personally also think that tuples would be good representation. The main 'problem' with it is that something like |
I've made a suggestion that works locally but would appreciate input as to the exact format. I too like the tuple format the best, but the tuples should IMO be vertically stacked, so the user at a glance can see both the individual level (vertically) and each row (horizontally). Various attributes should also be shown at the bottom, to mirror CategoricalIndex. An example with a reasonably complex MultiIndex:
I like that levels has type information. The individual levels should concatenate their value on the first line, but otherwise I like this. Thoughts? |
Any comment on the proposed format? |
What do you do if not all combinations are present? Probably still need to show |
Not IMO. I see labels as an implementation detail, similar to CategoricalIndex.codes. There is even an approved issue (#13443) to change the name of So, IMO we should absolutely not show labels in the repr, it just confuses. |
Fair enough. Plus, there'll be the repr of the tuples already. |
@topper-123 a note on impl we already have all of the machinery to do this see what we are doing for all other Indexes and follow the pattern / subclassing this is a hard problem because of display wrapping and indentation - but it is already solved |
Yes, I did that mostly, though I had a problem with wrapping each value in a new line, rather than wrapping several values, as other indexes do. I'll look into it, it I can get that part unified with the rest also. |
Current proposal looks like this: >>> n = 1_000_000
>>> ci = pd.CategoricalIndex(list('a' * n) + (['bcd'] * n), categories=['a', 'bcd'], ordered=True)
>>> dti =pd.date_range('2000-01-01', freq='s', periods=2000000)
>>> mi = pd.MultiIndex.from_arrays([ci, ci.codes+9, dti, dti, dti], names = ['a', 'b', 'x', 'x2', 'x3'])
>>> mi
MultiIndex([( 'a', 9, '2000-01-01 00:00:00', '2000-01-01 00:00:00', ...),
( 'a', 9, '2000-01-01 00:00:01', '2000-01-01 00:00:01', ...),
( 'a', 9, '2000-01-01 00:00:02', '2000-01-01 00:00:02', ...),
( 'a', 9, '2000-01-01 00:00:03', '2000-01-01 00:00:03', ...),
( 'a', 9, '2000-01-01 00:00:04', '2000-01-01 00:00:04', ...),
( 'a', 9, '2000-01-01 00:00:05', '2000-01-01 00:00:05', ...),
( 'a', 9, '2000-01-01 00:00:06', '2000-01-01 00:00:06', ...),
( 'a', 9, '2000-01-01 00:00:07', '2000-01-01 00:00:07', ...),
( 'a', 9, '2000-01-01 00:00:08', '2000-01-01 00:00:08', ...),
( 'a', 9, '2000-01-01 00:00:09', '2000-01-01 00:00:09', ...),
...
('bcd', 10, '2000-01-24 03:33:10', '2000-01-24 03:33:10', ...),
('bcd', 10, '2000-01-24 03:33:11', '2000-01-24 03:33:11', ...),
('bcd', 10, '2000-01-24 03:33:12', '2000-01-24 03:33:12', ...),
('bcd', 10, '2000-01-24 03:33:13', '2000-01-24 03:33:13', ...),
('bcd', 10, '2000-01-24 03:33:14', '2000-01-24 03:33:14', ...),
('bcd', 10, '2000-01-24 03:33:15', '2000-01-24 03:33:15', ...),
('bcd', 10, '2000-01-24 03:33:16', '2000-01-24 03:33:16', ...),
('bcd', 10, '2000-01-24 03:33:17', '2000-01-24 03:33:17', ...),
('bcd', 10, '2000-01-24 03:33:18', '2000-01-24 03:33:18', ...),
('bcd', 10, '2000-01-24 03:33:19', '2000-01-24 03:33:19', ...)],
dtype='object', names=['a', 'b', 'x', 'x2', 'x3'], length=2000000) So we now got:
Comments on the look of this? |
@pandas-dev/pandas-core I would like to draw some attention to this. TL;DR: There is a PR implementing a new MultiIndex repr (#22511), which is a kind of stuck because I want more feedback on the proposed repr. The MultiIndex repr can use some improvement, see top post (#13480 (comment)) for some reasons. @topper-123 made a concrete proposal 3 months ago (see the post above this one for details), and has a PR implementing it (#22511):
I have some remarks on that proposal (summarized here from #22511 (review)):
The MultiIndex is a more complex object than a normal Index (multiple levels -> multiple dtypes and names, typically repeated values in a level), so therefore we could consider also a more advanced repr for it. Just as an example of what another repr for the MultiIndex could be, here a small mock-up (inspired on
The formatting is certainly not yet perfect, but it's to give an idea. I personally think the above is more informative than the other proposed repr (it includes an easier overview of the different levels, it includes information about the dtypes of the levels, ..). My main issue is that I would just like to have at least some discussion on this. We actually never discussed the initial proposal of @topper-123, and also never really discussed my other idea. And to be a clear: I am not saying that my idea is necessarily better, I mainly want us to have at least a bit of discussion about it, as we should not change a repr lightly (and if we change it now, we should not change it again for some time). I certainly find what is in the PR an clear improvement over master (it is also based on what I initially proposed myself when originally opening this issue). |
I didn't realize that @topper-123's proposed repr would truncate past
certain levels, but that is of course unavoidable. Does #22511
<#22511>
include a system like DataFrame, where large reprs are automatically
truncated?
IMO, the most important part of a MutlIndex repr is seeing the name and
type of each level, and a few values
if possible. If you need to see more values, then it can be used as the
index for a Series, and you'll get the
(maybe sparsified) version of what's above.
…On Thu, Dec 6, 2018 at 4:28 AM Joris Van den Bossche < ***@***.***> wrote:
@pandas-dev/pandas-core
<https://github.com/orgs/pandas-dev/teams/pandas-core> I would like to
draw some attention to this.
TL;DR: There is a PR implementing a new MultiIndex repr (#22511
<#22511>), which is a kind of
stuck because I want more feedback on the proposed repr.
------------------------------
The MultiIndex repr can use some improvement, see top post (#13480
(comment)
<#13480 (comment)>) for
some reasons.
@topper-123 <https://github.com/topper-123> made a concrete proposal 3
months ago (see the post above this one for details), and has a PR
implementing it (#22511 <#22511>
):
MultiIndex([( 'a', 9, '2000-01-01 00:00:00', '2000-01-01 00:00:00', ...),
( 'a', 9, '2000-01-01 00:00:01', '2000-01-01 00:00:01', ...),
( 'a', 9, '2000-01-01 00:00:02', '2000-01-01 00:00:02', ...),
( 'a', 9, '2000-01-01 00:00:03', '2000-01-01 00:00:03', ...),
( 'a', 9, '2000-01-01 00:00:04', '2000-01-01 00:00:04', ...),
( 'a', 9, '2000-01-01 00:00:05', '2000-01-01 00:00:05', ...),
( 'a', 9, '2000-01-01 00:00:06', '2000-01-01 00:00:06', ...),
( 'a', 9, '2000-01-01 00:00:07', '2000-01-01 00:00:07', ...),
( 'a', 9, '2000-01-01 00:00:08', '2000-01-01 00:00:08', ...),
( 'a', 9, '2000-01-01 00:00:09', '2000-01-01 00:00:09', ...),
...
('bcd', 10, '2000-01-24 03:33:10', '2000-01-24 03:33:10', ...),
('bcd', 10, '2000-01-24 03:33:11', '2000-01-24 03:33:11', ...),
('bcd', 10, '2000-01-24 03:33:12', '2000-01-24 03:33:12', ...),
('bcd', 10, '2000-01-24 03:33:13', '2000-01-24 03:33:13', ...),
('bcd', 10, '2000-01-24 03:33:14', '2000-01-24 03:33:14', ...),
('bcd', 10, '2000-01-24 03:33:15', '2000-01-24 03:33:15', ...),
('bcd', 10, '2000-01-24 03:33:16', '2000-01-24 03:33:16', ...),
('bcd', 10, '2000-01-24 03:33:17', '2000-01-24 03:33:17', ...),
('bcd', 10, '2000-01-24 03:33:18', '2000-01-24 03:33:18', ...),
('bcd', 10, '2000-01-24 03:33:19', '2000-01-24 03:33:19', ...)],
dtype='object', names=['a', 'b', 'x', 'x2', 'x3'], length=2000000)
I have some remarks on that proposal (summarized here from #22511 (review)
<#22511 (review)>
):
-
It looks like valid code but it is actually not:
- The main reason is because the default constructor does not accept a
list of tuples (MI.from_tuples does that), although there is an
issue to discuss to change this: #23887
<#23887>. So depending
on that, this might be a moot argument.
- Even regardless of the above, due to truncation symbols and the
length indication, it is often still not valid code. But this is of course
exactly the same situation as with other Index reprs.
-
When you have more levels, the tuples in the repr get truncated (as in
the example above). This has the consequence that for such cases, you don't
see anything in the default repr about this level, except the name (but for
example not even the type). Of course, this will typically not be an issue
with MIs with 2-3 levels.
For a DataFrame the situation can be similar (columns not visible in
the default repr), but you have eg the info() method to get an
overview of the columns and its types. So this could be an argument to add
a similar method to MI, but it could maybe also be included in the main
repr? (see below)
The MultiIndex is a more complex object than a normal Index (multiple
levels -> multiple dtypes and names, typically repeated values in a level),
so therefore we could consider also a more advanced repr for it.
Just as an example of what another repr for the MultiIndex could be, here
a small mock-up (inspired on info and on the repr of xarrays objects
where the different dimensions are listed, initially posted here: #22511
(comment)
<#22511 (comment)>):
<pandas.MultiIndex>
Levels (5):
* a: category (2) ['a', 'abc']
* b: int64 (2) [9, 10]
* dti_1: datetime64[ns] (2000) ['2000-01-01 00:00:00', ... ]
* dti_2: datetime64[ns] (2000) ['2000-01-01 00:00:00', ... ]
* dti_3: datetime64[ns] (2000) ['2000-01-01 00:00:00', ... ]
[('a', 9, '2000-01-01 00:00:00', '2000-01-01 00:00:00', '2000-01-01 00:00:00'),
('a', 9, '2000-01-01 00:00:01', '2000-01-01 00:00:01', '2000-01-01 00:00:01'),
('a', 9, '2000-01-01 00:00:02', '2000-01-01 00:00:02', '2000-01-01 00:00:02'),
...,
('abc', 10, '2000-01-01 00:33:17', '2000-01-01 00:33:17', '2000-01-01 00:33:17'),
('abc', 10, '2000-01-01 00:33:18', '2000-01-01 00:33:18', '2000-01-01 00:33:18'),
('abc', 10, '2000-01-01 00:33:19', '2000-01-01 00:33:19', '2000-01-01 00:33:19')],
Lenght: 2000
The formatting is certainly not yet perfect, but it's to give an idea. I
personally think the above is more *informative* than the other proposed
repr (it includes an easier overview of the different levels, it includes
information about the dtypes of the levels, ..).
My main issue is that I would just like to have at least *some*
discussion on this. We actually never discussed the initial proposal of
@topper-123 <https://github.com/topper-123>, and also never really
discussed my other idea.
*What do we think is the best / most informative repr for the MultiIndex?*
And to be a clear: I am not saying that my idea is necessarily better, I
mainly want us to have at least a bit of discussion about it, as we should
not change a repr lightly (and if we change it now, we should not change it
again for some time). I certainly find what is in the PR an clear
improvement over master (it is also based on what I initially proposed
myself when originally opening this issue).
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#13480 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIv1AI9IujTnrViHEwcJYNM_37lAoks5u2PE4gaJpZM4I46bJ>
.
|
@jorisvandenbossche I actually like the bottom repr (xarray like), but I agree that is maybe more food for a I would go with @topper-123 repr, with 1 change. Let's drop I also don't mind/care that @topper-123 repr is not actually executable, this principle is just true for short Index reprs, and is not true for EA at all. So I don't think this is a consideration. |
@TomAugspurger What do you mean exactly?
The current repr in #22511 only shows the names (and the values if fitting in the width of the console)
If we go with #22511, I think adding an
I think if it is combined with some values as tuples, I personally don't think it would be too much of a problem. We could maybe also put the preview of values as tuples on top, and the overview of levels below it. |
Oh, I was thinking in the past, when we had
I think I agree with this... Personally, when I need to see the values of a MultiIndex I stick it in a Series |
I'm in agreement with @jreback. I also like having a
That would mean adding a EDIT: other indexes use Index.dtype.name and not just index.dtype in their repr, changed it to be similar. |
maybe could have a way to display the names & dtypes in a nicer way here as well (in a combined way) |
hmm, some levels may not have a name. What did you have in mind? |
hmm yeah maybe just zip them? into a tulple list |
I think that would confuse. If we simply use dtypes and names, people can look up individually using MultiIndex.dtypes and MultiIndex.names. I think that would be nice and simple. BTW, I did the repr above slightly different, if you didn’t notice. |
I've been thinking a bit more about dtypes in the repr. Having dtypes in the repr would make it not possible to recreate multiindexes from their repr (after implementing #23887 and possibly setting I agree with @jorisvandenbossche that the current proposal is real-looking and therefore should be creatable from the repr, if possible. This is also a good/common convention in Python. So I'm leaning towards not having dtypes in the repr after all would be the best trade-off. Thoughts? |
(forgot to post, comment of a day ago; not yet an answer on the last comment of @topper-123 )
I have the feeling that adding this additional information for a MultiIndex somewhat bumps into the limits of what the current constructor-like Index repr can handle. Why do we necessarily want to keep to the python-code-like repr if we are adding a lot of "keywords" that are not actually keywords? (although the same is already true for "length" for normal Index)
Well, that's basically what I tried with my proposal (not saying it was the best attempt, but that was at least the idea of it.
although I don't know how to make it clear that the numbers between brackets are the number of unique values in the level. I am not sure you can have such a combination with python-like code. @jreback What would you propose how it looks like? (a dict?) |
I am -1 on the xarray like constructor. I think it’s completely misleading as this is a row oriented object. I would be +1 on the proposed repr as is (w/o the dtype=‘object’) and +0 on adding a .info() I completely discount the point of non-executable nature of this code as this is irrelevant; any index is non repr after truncation; furthermore there is a deliberate efffort to make non repr Arrays as well. |
Personally, I don't have a strong preference. I find the "info" style one a
bit more informative at a glance,
but if that's available as a .info method then that's fine too.
We'll need to see what Joris thinks.
And just to be clear, thanks for your work on this @topper-123. Either way,
this is a big improvement over the current repr.
…On Fri, Dec 14, 2018 at 6:16 AM Jeff Reback ***@***.***> wrote:
I am -1 on the xarray like constructor. I think it’s completely misleading
as this is a row oriented object. I would be +1 on the proposed repr as is
(w/o the dtype=‘object’) and +0 on adding a .info()
I completely discount the point of non-executable nature of this code as
this is irrelevant; any index is non repr after truncation; furthermore
there is a deliberate efffort to make non repr Arrays as well.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13480 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIjq0v347ua5LrN-i9v3pGe6cPQzMks5u45a1gaJpZM4I46bJ>
.
|
To be clear: I am not saying that the repr should be executable (as my proposed actually isn't). I am mainly trying to argue that we shouldn't jump through hoops and end up with a less informative repr only to keep it "python code-like" (eg adding the multiple dtypes in a keyword), while it is not executable anyway.
Yes, I understand that. But that is not the only argument. The main reason for having the overview of the levels is to be more informative + counteract that the repr in the PR can hide levels.
@jreback What do you mean here?
Above you said "actually like the bottom repr (xarray like)", or would you still like it as info object?
Big +1 ! |
I have already stated that I am -1 on the xarray like repr as the default as its amazingly confusing for a row-oriented object. We have spent an enormous amount of time on this. @jorisvandenbossche Either propose a new PR, or accept the existing. |
Joris did propose an alternative... Regardless, this seems worth taking the
time to get it right.
…On Sun, Dec 16, 2018 at 1:55 PM Jeff Reback ***@***.***> wrote:
Above you said "actually like the bottom repr (xarray like)", or would you
still like it as info object?
.info() could be added but that is not the subject of this issue / PR.
I have already stated that I am -1 on the xarray like repr as the default
as its amazingly confusing for a row-oriented object.
We have spent an enormous amount of time on this. @jorisvandenbossche
<https://github.com/jorisvandenbossche> Either propose a new PR, or
accept the existing.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#13480 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIl701dzDRoyeD2gVBZbxhUZrX02Lks5u5qUhgaJpZM4I46bJ>
.
|
And I have numerous times been -1 on that. |
From #13443 (comment)
The current MultiIndex representation looks like this:
So this shows the underlying
labels
andlevels
. Personally, I don't find this a very good repr, because:level
andlabels
are actually (internal) concepts of MultiIndex a lot of user do not have to think about.So the question is: is there a better alternative?
We could show tuples:
Or the individual levels:
or ..
Historical note: in older versions, there was a difference between the
repr
andstr
of a MultiIndex:but seems to been disappeared (on purpose or not).
The text was updated successfully, but these errors were encountered: