Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support kind and na_position kwargs in Series.sort_index #14445

Merged

Conversation

brandonmburroughs
Copy link
Contributor

@brandonmburroughs brandonmburroughs commented Oct 17, 2016

closes #13589
closes #13729

I'm waiting on adding the whatsnew entry. I'm not sure if this is a BUG (the documentation says these parameters exist) or an ENH (we're adding support for these parameters). It depends on how you look at it!

@codecov-io
Copy link

codecov-io commented Oct 18, 2016

Current coverage is 85.20% (diff: 100%)

No coverage report found for master at 3d73d05.

Powered by Codecov. Last update 3d73d05...b013103

@@ -144,3 +144,27 @@ def test_sort_index_multiindex(self):
# rows share same level='A': sort has no effect without remaining lvls
res = s.sort_index(level='A', sort_remaining=False)
assert_series_equal(s, res)

def test_sort_index_kind(self):
series = Series(index=[3, 2, 1, 4, 3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment with the github reference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

ascending=ascending)
from pandas.core.groupby import _nargsort

if ((ascending and index.is_monotonic_increasing) or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do this optimization inside _nargsort. you can try and see if anything breaks the test suite.

this would just return an indexer like np.arange(len(index)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. There isn't an explicit function to check monotonicity, but I've added similar functionality in _nargsort. This also means we can remove this checking from the similar sort_index function in pandas/core/frame.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls remove this part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's now removed. I'll work on moving that functionality into _nargsort.

@jreback
Copy link
Contributor

jreback commented Oct 19, 2016

pls add a whatsnew note, 0.19.1 is fine. (you can list both issues)

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 19, 2016
@jorisvandenbossche jorisvandenbossche added this to the 0.19.1 milestone Oct 20, 2016
@@ -48,3 +48,4 @@ Bug Fixes
- Bug in ``MultiIndex.set_levels`` where illegal level values were still set after raising an error (:issue:`13754`)
- Bug in ``DataFrame.to_json`` where ``lines=True`` and a value contained a ``}`` character (:issue:`14391`)
- Bug in ``df.groupby`` causing an ``AttributeError`` when grouping a single index frame by a column and the index level (:issue`14327`)
- Bug in ``Series.sort_index`` where parameters ``kind`` and ``na_position`` did not exist (:issue:`13589` & :issue:`14444`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, -> &

@@ -4280,6 +4280,11 @@ def _nargsort(items, kind='quicksort', ascending=True, na_position='last'):
if is_categorical_dtype(items):
return items.argsort(ascending=ascending)

# # GH11080 - Check monotonic-ness before sort an index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this optimization is very problematic. Its quite inefficient (we already have these checks on an actual Index, but we don't guarantee this is an index here, though easy to wrap). It doesn't handle nans properly. If it passes the test suite I guess its ok, but would need a performance review. Remove and put in a new PR please.

@@ -144,3 +144,28 @@ def test_sort_index_multiindex(self):
# rows share same level='A': sort has no effect without remaining lvls
res = s.sort_index(level='A', sort_remaining=False)
assert_series_equal(s, res)

# GH #14444 & #13589: Add support for sort algo choosing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the comment inside the function

@jorisvandenbossche
Copy link
Member

I think this only handles single Index for now, we should also deal with MultiIndex (see also the older PR for this https://github.com/pandas-dev/pandas/pull/13729/files)

@jreback
Copy link
Contributor

jreback commented Oct 20, 2016

@jorisvandenbossche that looks right. on merge let's make sure to close the other PR

@brandonmburroughs brandonmburroughs force-pushed the series_sort_index branch 2 times, most recently from 7a65e84 to 5cade34 Compare October 20, 2016 23:50
@@ -37,6 +37,7 @@ Bug Fixes
- Bug in localizing an ambiguous timezone when a boolean is passed (:issue:`14402`)


- Bug in ``Series.sort_index`` where parameters ``kind`` and ``na_position`` did not exist (:issue:`13589`, :issue:`14444`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche want to highlite this change. maybe enhancements section? (even though its a bug fix / api change type of fix)

@@ -1786,8 +1786,11 @@ def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
indexer = _ensure_platform_int(indexer)
new_index = index.take(indexer)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put:

indexer = _ensure_platform_int(indexer)
new_index = index.take(indexer)

outside the if-else block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; moved.

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, 0.19.1 Nov 1, 2016
@@ -30,6 +30,7 @@ Performance Improvements

Bug Fixes
~~~~~~~~~
- Bug in ``Series.sort_index`` where parameters ``kind`` and ``na_position`` did not exist (:issue:`13589`, :issue:`14444`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to 0.20.0

@jreback
Copy link
Contributor

jreback commented Nov 16, 2016

ping on green.

@jorisvandenbossche
Copy link
Member

For me it's good to have this merged, but note that it does not fully close the issue, as it's not yet solved for MultiIndex, so I would leave the issue open and clarify there.
The DataFrame.sort_index can handle na_position for MultiIndex, so I suspect it should not be difficult to add this to Series.sort_index as well.

@brandonmburroughs
Copy link
Contributor Author

@jorisvandenbossche I overlooked that; thanks for pointing it out. I've now added na_position to MultiIndex as well. Thanks!

@jreback Moved to other enhancements in 0.20.0.

@brandonmburroughs brandonmburroughs force-pushed the series_sort_index branch 2 times, most recently from 8a6bb77 to b013103 Compare November 20, 2016 23:02
@jorisvandenbossche
Copy link
Member

@brandonmburroughs thanks for the update. Last question: can you also add a test for 'na_position' for the multi-index case

@brandonmburroughs
Copy link
Contributor Author

@jorisvandenbossche As I was adding the test, I realized that this doesn't seem to really use the na_position argument due to the way we sort the MultiIndex. Whenever we create a MultiIndex, we store the labels as relative values. For instance, if we have the following MultiIndex:

MultiIndex.from_tuples([[1, 1, 3], [1, 1, np.nan]], names=list('ABC'))

the values get stored as

MultiIndex(levels=[[1], [1], [3]],
           labels=[[0, 0], [0, 0], [0, -1]],
           names=[u'A', u'B', u'C'])

These label values are what get passed to the sorting algorithm for both DataFrames and Series. Since the sorting only happens on the labels, it has no notion of the NaN. Hence, even with the updated code, we get the following results for a Series:

In [206]: mi = MultiIndex.from_tuples([[1, 1, 3], [1, 1, np.nan]], names=list('ABC'))

In [207]: s = Series([1, 2], mi)

In [208]: s.sort_index(na_position="last")
Out[208]: 
A  B  C  
1  1  NaN    2
      3      1
dtype: int64

In [209]: s.sort_index(na_position="first")
Out[209]: 
A  B  C  
1  1  NaN    2
      3      1
dtype: int64

As I did more digging, I realized the same thing happens for DataFrame and it would seem the na_position argument never really worked.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: pd.__version__
Out[3]: u'0.19.0'

In [4]: mi = pd.MultiIndex.from_tuples([[1, 1, 3], [1, 1, np.nan]], names=list('ABC'))

In [5]: df = pd.DataFrame([[1, 2], [3, 4]], mi)

In [6]: df.sort_index(na_position="first")
Out[6]: 
         0  1
A B C        
1 1 NaN  3  4
    3    1  2

In [7]: df.sort_index(na_position="last")
Out[7]: 
         0  1
A B C        
1 1 NaN  3  4
    3    1  2

I've done a bit of looking and I think it can be fixed by passing the values of the MultiIndex to sort rather than the labels. Should I

  1. Add the changes here to be reviewed?
  2. Create a new pull request since it's kind of a separate problem?
  3. Create an issue detailing this behavior and then create a pull request?

@jreback
Copy link
Contributor

jreback commented Nov 28, 2016

this has not been implemented afair for multi index; it's straightforward to do; you can create an issue about it

you will always sort labels; however (-1) is the nan indicator

@jorisvandenbossche
Copy link
Member

@brandonmburroughs Let's just leave MultiIndex out of this PR then (so it can be merged), it indeed does not work for DataFrames as well (maybe you can add to the docstring of this keyword something like "Not implemented for MultiIndex").

I think it can be fixed by passing the values of the MultiIndex to sort rather than the labels

I would first open an issue about it (or look for an existing one, I would think we already have an issue about it) to discuss it, as this is not a simple change (maybe in code changes, but has implications for usage)

@jorisvandenbossche
Copy link
Member

Regarding the MutliIndex sort, I don't directly find an issue about it explicitly, but see the discussion in #14015, and an issue that followed from that discussion #14672.

@jreback
Copy link
Contributor

jreback commented Nov 28, 2016

all of @jorisvandenbossche comments are good

multi index sorting is not well implemented except for lexsorting
let's open a new issue (maybe should instead be a master issue encompassing the other 2 mi sort issues)

@brandonmburroughs
Copy link
Contributor Author

Okay, thanks for all of the feedback @jreback and @jorisvandenbossche . I created a new issue for na_position specifically, #14784 . I also added a note in the sort_index docstring that na_position isn't implemented for MultiIndex.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a small remaining comment

indexer = _ensure_platform_int(indexer)
new_index = index.take(indexer)
indexer = _lexsort_indexer(index.labels, orders=ascending,
na_position=na_position)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave this line out? (if it's not supported, not really needed to pass it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, leave out the na_position argument from _lexsort_indexer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes (as it has no effect I understood?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is true. Unfortunately, it has no effect right now. I pushed that change now.

@jorisvandenbossche jorisvandenbossche merged commit 5d0e157 into pandas-dev:master Dec 4, 2016
@jorisvandenbossche jorisvandenbossche changed the title Updating sort_index for Series to follow pattern DataFrame ENH: support kind and na_position kwargs in Series.sort_index Dec 4, 2016
@jorisvandenbossche
Copy link
Member

@brandonmburroughs Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Series.sort_index does not accept kwargs kind and na_position.
4 participants