Skip to content

sort_index fails on MutiIndex'ed DataFrame resulting from groupby.apply #15687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
8one6 opened this issue Mar 14, 2017 · 6 comments
Closed

sort_index fails on MutiIndex'ed DataFrame resulting from groupby.apply #15687

8one6 opened this issue Mar 14, 2017 · 6 comments
Labels
Groupby MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@8one6
Copy link

8one6 commented Mar 14, 2017

I feel like this is part of a suite of bugs that come from a failure to notice when a MultiIndex that was once lexsorted loses its lexsortedness. I submitted one example of this a couple years ago (see #8017), but this related bug persists.

On Pandas 0.19.0:

import numpy as np
import pandas as pd

np.random.seed(0)

df = pd.DataFrame(
    np.random.randn(8, 2),
    index=pd.MultiIndex.from_product([['a', 'b'], ['big', 'small'], ['red', 'blue']], names=['letter', 'size', 'color']),
    columns=['near', 'far']
)
df = df.sort_index()

def my_func(group):
    group.index = ['newz', 'newa']
    return group

res = df.groupby(level=['letter', 'size']).apply(my_func).sort_index()

print res
###OUTPUT###
                       near       far
letter size                          
a      big   newz  0.978738  2.240893
             newa  1.764052  0.400157
       small newz  0.950088 -0.151357
             newa  1.867558 -0.977278
b      big   newz  0.144044  1.454274
             newa -0.103219  0.410599
       small newz  0.443863  0.333674
             newa  0.761038  0.121675

So before the apply command, df was properly sorted on the row index. However, as you can see, res is not properly sorted, even though its creation ends with a sort_index command.

This is a bug, right? I would think we want people to be able to assume that anytime they call sort_index that the result comes out lexicographically sorted, no?

@jreback
Copy link
Contributor

jreback commented Mar 14, 2017

this is a dupe of #15622

yes the bug here is that it is constructed in a different order when concat is used (though still lexsorted). .concat is used inside groupby.apply here

@jreback jreback closed this as completed Mar 14, 2017
@jreback jreback added Duplicate Report Duplicate issue or pull request Groupby MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 14, 2017
@jreback jreback added this to the No action milestone Mar 14, 2017
@jreback
Copy link
Contributor

jreback commented Mar 14, 2017

actually this is not a dupe of #15622, rather this is correct.

@8one6 see below. lexsortedness has to do with the label orderings w.r.t. to the levels (which may or may not be sorted). see if the below makes sense to you

# the result from above
In [7]: res
Out[7]: 
                       near       far
letter size                          
a      big   newz  0.978738  2.240893
             newa  1.764052  0.400157
       small newz  0.950088 -0.151357
             newa  1.867558 -0.977278
b      big   newz  0.144044  1.454274
             newa -0.103219  0.410599
       small newz  0.443863  0.333674
             newa  0.761038  0.121675

# just setting names and a count
In [8]: res.index.names=['letter','size','other']

In [9]: res['count'] = np.arange(len(res))

In [10]: res
Out[10]: 
                        near       far  count
letter size  other                           
a      big   newz   0.978738  2.240893      0
             newa   1.764052  0.400157      1
       small newz   0.950088 -0.151357      2
             newa   1.867558 -0.977278      3
b      big   newz   0.144044  1.454274      4
             newa  -0.103219  0.410599      5
       small newz   0.443863  0.333674      6
             newa   0.761038  0.121675      7

This is a resort based on the 3 columns, we are reconstructing things here (IOW starting new), NOT from the existing levels

In [11]: res.reset_index().sort_values(['letter', 'size', 'other'])
Out[11]: 
  letter   size other      near       far  count
1      a    big  newa  1.764052  0.400157      1
0      a    big  newz  0.978738  2.240893      0
3      a  small  newa  1.867558 -0.977278      3
2      a  small  newz  0.950088 -0.151357      2
5      b    big  newa -0.103219  0.410599      5
4      b    big  newz  0.144044  1.454274      4
7      b  small  newa  0.761038  0.121675      7
6      b  small  newz  0.443863  0.333674      6

# if we set it again, its sorted (we set it this way)
In [12]: res.reset_index().sort_values(['letter', 'size', 'other']).set_index(['letter', 'size', 'other'])
Out[12]: 
                        near       far  count
letter size  other                           
a      big   newa   1.764052  0.400157      1
             newz   0.978738  2.240893      0
       small newa   1.867558 -0.977278      3
             newz   0.950088 -0.151357      2
b      big   newa  -0.103219  0.410599      5
             newz   0.144044  1.454274      4
       small newa   0.761038  0.121675      7
             newz   0.443863  0.333674      6

# this IS lexsorted
In [13]: res.reset_index().sort_values(['letter', 'size', 'other']).set_index(['letter', 'size', 'other']).index.is_lexsorted()
Out[13]: True

In [14]: res.reset_index().sort_values(['letter', 'size', 'other']).set_index(['letter', 'size', 'other']).index
Out[14]: 
MultiIndex(levels=[['a', 'b'], ['big', 'small'], ['newa', 'newz']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['letter', 'size', 'other'])

# as is the original (but notice that the levels themselves are ordered differently).
In [15]: res.index.is_lexsorted()
Out[15]: True

In [16]: res.index
Out[16]: 
MultiIndex(levels=[['a', 'b'], ['big', 'small'], ['newz', 'newa']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['letter', 'size', 'other'])

I agree this is a tricky concept and we don't reorder level values when sorting.

@jreback
Copy link
Contributor

jreback commented Mar 14, 2017

this is also related to this: #14672 (or maybe that will make more sense)

@8one6
Copy link
Author

8one6 commented Mar 14, 2017

First off, thanks for taking the timer to answer this.

But can you help me get my head around this: Are you saying this behavior is correct as above (i.e. that the goal of sort_index is to sort things in the order specified in the internals of the MultiIndex), or are you agreeing that the example above demonstrates a bug (and that sort_index should always return something that is sorted in the traditional use of the term...that is alphabetically)?

If the former, let me raise my hand to say "isn't this going to confuse a ton of people"? And if the former, is there another method which would achieve the simple end of "sorting the index" in the usual way? And if the former, can we try to put this in the docs somewhere? If we're going this way, then I think the intended behavior of sort_index is likely to get users into a lot of trouble with unexpected behaviors.

If the latter, what, if anything, can be done to fix the current behavior?

Thanks!

@jreback
Copy link
Contributor

jreback commented Mar 14, 2017

well its currently the former. and yes its confusing. I believe its this ways because this can have a detrimental effect on performance (though its only if you are actually sorting that this matters).

I will reopen this because I think I can fix this and actually make it sort w/o regard to the lexsortedness.

@jreback jreback reopened this Mar 14, 2017
@jreback jreback modified the milestones: 0.20.0, No action Mar 14, 2017
@jreback jreback removed the Duplicate Report Duplicate issue or pull request label Mar 14, 2017
@8one6
Copy link
Author

8one6 commented Mar 14, 2017

I think I can (if I twist my head around funny) understand a context where a person might want to take advantage of the existing behavior. For example it could be a bit like an ordered categorical, where I just decide that for me the fruits ['grape', 'apple', 'cantaloupe'] should be considered in that order within their level (maybe i'm thinking about ordering fruits based on size). And then maybe, just maybe, when I sort a MultiIndex that contains those as a level, that I want them to come out in this very not alphabetical ordering.

But I'd stress that a) I can't imagine that's the more common use case and b) as of now I think the behavior is totally undocumented. I don't fully understand how the guts work here...but if I can help by proofreading new documentation or testing new functions, I'd love to.

Coming at this from my use cases, I would think that it is incredibly rare to see people deliberately creating ordered levels on a MultiIndex in any context other than an explicit call to the constructor. So for me, I'd love to see sort_index always do a "natural", "alphabetical" sort (leaving it to others to decide on how to compare across types). And perhaps there could be a sort_index_bespoke or something, or a kwarg for sort_index itself that would preserve the current behavior for those (uncommon?) cases where it is really desired.

I've found myself using the hack of

df.loc[sorted(df.index), :]

but that's just ugly, right?

@jreback jreback closed this as completed in f478e4f Apr 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants