sort_index fails on MutiIndex'ed DataFrame resulting from groupby.apply #15687

8one6 · 2017-03-14T18:04:56Z

I feel like this is part of a suite of bugs that come from a failure to notice when a MultiIndex that was once lexsorted loses its lexsortedness. I submitted one example of this a couple years ago (see #8017), but this related bug persists.

On Pandas 0.19.0:

import numpy as np
import pandas as pd

np.random.seed(0)

df = pd.DataFrame(
    np.random.randn(8, 2),
    index=pd.MultiIndex.from_product([['a', 'b'], ['big', 'small'], ['red', 'blue']], names=['letter', 'size', 'color']),
    columns=['near', 'far']
)
df = df.sort_index()

def my_func(group):
    group.index = ['newz', 'newa']
    return group

res = df.groupby(level=['letter', 'size']).apply(my_func).sort_index()

print res
###OUTPUT###
                       near       far
letter size                          
a      big   newz  0.978738  2.240893
             newa  1.764052  0.400157
       small newz  0.950088 -0.151357
             newa  1.867558 -0.977278
b      big   newz  0.144044  1.454274
             newa -0.103219  0.410599
       small newz  0.443863  0.333674
             newa  0.761038  0.121675

So before the apply command, df was properly sorted on the row index. However, as you can see, res is not properly sorted, even though its creation ends with a sort_index command.

This is a bug, right? I would think we want people to be able to assume that anytime they call sort_index that the result comes out lexicographically sorted, no?

The text was updated successfully, but these errors were encountered:

jreback · 2017-03-14T18:17:55Z

this is a dupe of #15622

yes the bug here is that it is constructed in a different order when concat is used (though still lexsorted). .concat is used inside groupby.apply here

jreback · 2017-03-14T23:13:34Z

actually this is not a dupe of #15622, rather this is correct.

@8one6 see below. lexsortedness has to do with the label orderings w.r.t. to the levels (which may or may not be sorted). see if the below makes sense to you

# the result from above
In [7]: res
Out[7]: 
                       near       far
letter size                          
a      big   newz  0.978738  2.240893
             newa  1.764052  0.400157
       small newz  0.950088 -0.151357
             newa  1.867558 -0.977278
b      big   newz  0.144044  1.454274
             newa -0.103219  0.410599
       small newz  0.443863  0.333674
             newa  0.761038  0.121675

# just setting names and a count
In [8]: res.index.names=['letter','size','other']

In [9]: res['count'] = np.arange(len(res))

In [10]: res
Out[10]: 
                        near       far  count
letter size  other                           
a      big   newz   0.978738  2.240893      0
             newa   1.764052  0.400157      1
       small newz   0.950088 -0.151357      2
             newa   1.867558 -0.977278      3
b      big   newz   0.144044  1.454274      4
             newa  -0.103219  0.410599      5
       small newz   0.443863  0.333674      6
             newa   0.761038  0.121675      7

This is a resort based on the 3 columns, we are reconstructing things here (IOW starting new), NOT from the existing levels

In [11]: res.reset_index().sort_values(['letter', 'size', 'other'])
Out[11]: 
  letter   size other      near       far  count
1      a    big  newa  1.764052  0.400157      1
0      a    big  newz  0.978738  2.240893      0
3      a  small  newa  1.867558 -0.977278      3
2      a  small  newz  0.950088 -0.151357      2
5      b    big  newa -0.103219  0.410599      5
4      b    big  newz  0.144044  1.454274      4
7      b  small  newa  0.761038  0.121675      7
6      b  small  newz  0.443863  0.333674      6

# if we set it again, its sorted (we set it this way)
In [12]: res.reset_index().sort_values(['letter', 'size', 'other']).set_index(['letter', 'size', 'other'])
Out[12]: 
                        near       far  count
letter size  other                           
a      big   newa   1.764052  0.400157      1
             newz   0.978738  2.240893      0
       small newa   1.867558 -0.977278      3
             newz   0.950088 -0.151357      2
b      big   newa  -0.103219  0.410599      5
             newz   0.144044  1.454274      4
       small newa   0.761038  0.121675      7
             newz   0.443863  0.333674      6

# this IS lexsorted
In [13]: res.reset_index().sort_values(['letter', 'size', 'other']).set_index(['letter', 'size', 'other']).index.is_lexsorted()
Out[13]: True

In [14]: res.reset_index().sort_values(['letter', 'size', 'other']).set_index(['letter', 'size', 'other']).index
Out[14]: 
MultiIndex(levels=[['a', 'b'], ['big', 'small'], ['newa', 'newz']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['letter', 'size', 'other'])

# as is the original (but notice that the levels themselves are ordered differently).
In [15]: res.index.is_lexsorted()
Out[15]: True

In [16]: res.index
Out[16]: 
MultiIndex(levels=[['a', 'b'], ['big', 'small'], ['newz', 'newa']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['letter', 'size', 'other'])

I agree this is a tricky concept and we don't reorder level values when sorting.

jreback · 2017-03-14T23:15:20Z

this is also related to this: #14672 (or maybe that will make more sense)

8one6 · 2017-03-14T23:30:11Z

First off, thanks for taking the timer to answer this.

But can you help me get my head around this: Are you saying this behavior is correct as above (i.e. that the goal of sort_index is to sort things in the order specified in the internals of the MultiIndex), or are you agreeing that the example above demonstrates a bug (and that sort_index should always return something that is sorted in the traditional use of the term...that is alphabetically)?

If the former, let me raise my hand to say "isn't this going to confuse a ton of people"? And if the former, is there another method which would achieve the simple end of "sorting the index" in the usual way? And if the former, can we try to put this in the docs somewhere? If we're going this way, then I think the intended behavior of sort_index is likely to get users into a lot of trouble with unexpected behaviors.

If the latter, what, if anything, can be done to fix the current behavior?

Thanks!

jreback · 2017-03-14T23:34:17Z

well its currently the former. and yes its confusing. I believe its this ways because this can have a detrimental effect on performance (though its only if you are actually sorting that this matters).

I will reopen this because I think I can fix this and actually make it sort w/o regard to the lexsortedness.

8one6 · 2017-03-14T23:41:10Z

I think I can (if I twist my head around funny) understand a context where a person might want to take advantage of the existing behavior. For example it could be a bit like an ordered categorical, where I just decide that for me the fruits ['grape', 'apple', 'cantaloupe'] should be considered in that order within their level (maybe i'm thinking about ordering fruits based on size). And then maybe, just maybe, when I sort a MultiIndex that contains those as a level, that I want them to come out in this very not alphabetical ordering.

But I'd stress that a) I can't imagine that's the more common use case and b) as of now I think the behavior is totally undocumented. I don't fully understand how the guts work here...but if I can help by proofreading new documentation or testing new functions, I'd love to.

Coming at this from my use cases, I would think that it is incredibly rare to see people deliberately creating ordered levels on a MultiIndex in any context other than an explicit call to the constructor. So for me, I'd love to see sort_index always do a "natural", "alphabetical" sort (leaving it to others to decide on how to compare across types). And perhaps there could be a sort_index_bespoke or something, or a kwarg for sort_index itself that would preserve the current behavior for those (uncommon?) cases where it is really desired.

I've found myself using the hack of

df.loc[sorted(df.index), :]

but that's just ugly, right?

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 xref pandas-dev#13431

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback closed this as completed Mar 14, 2017

jreback added Duplicate Report Duplicate issue or pull request Groupby MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 14, 2017

jreback added this to the No action milestone Mar 14, 2017

jreback reopened this Mar 14, 2017

jreback modified the milestones: 0.20.0, No action Mar 14, 2017

jreback removed the Duplicate Report Duplicate issue or pull request label Mar 14, 2017

jreback added a commit to jreback/pandas that referenced this issue Mar 15, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

5155ebb

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015

jreback mentioned this issue Mar 15, 2017

BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels #15694

Closed

jreback added a commit to jreback/pandas that referenced this issue Mar 15, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

e7c0c14

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 xref pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 15, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

72bc7d0

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 xref pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

698e05f

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

a6f352c

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

54c6e93

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

1a9be09

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

685dadc

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

efaf233

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 19, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

dd85330

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 22, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

933710c

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 22, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

53c7e6c

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 22, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

083b4b0

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 23, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

7784ec0

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 25, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

09f842d

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Apr 2, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

ae3777e

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Apr 4, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

84c8999

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Apr 4, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

53b844d

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Apr 6, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

a1390ce

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Apr 7, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

47c67d6

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback closed this as completed in f478e4f Apr 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sort_index fails on MutiIndex'ed DataFrame resulting from groupby.apply #15687

sort_index fails on MutiIndex'ed DataFrame resulting from groupby.apply #15687

8one6 commented Mar 14, 2017

jreback commented Mar 14, 2017

jreback commented Mar 14, 2017

jreback commented Mar 14, 2017

8one6 commented Mar 14, 2017

jreback commented Mar 14, 2017

8one6 commented Mar 14, 2017 •

edited

Loading

sort_index fails on MutiIndex'ed DataFrame resulting from groupby.apply #15687

sort_index fails on MutiIndex'ed DataFrame resulting from groupby.apply #15687

Comments

8one6 commented Mar 14, 2017

jreback commented Mar 14, 2017

jreback commented Mar 14, 2017

jreback commented Mar 14, 2017

8one6 commented Mar 14, 2017

jreback commented Mar 14, 2017

8one6 commented Mar 14, 2017 • edited Loading

8one6 commented Mar 14, 2017 •

edited

Loading