Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow keep='all' for nlargest/nsmallest #21650

Merged
merged 1 commit into from
Jun 28, 2018

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Jun 27, 2018

Title is self-explanatory.

Closes #16818.
Closes #18656.

@gfyoung gfyoung added Enhancement API Design Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jun 27, 2018
@gfyoung gfyoung added this to the 0.24.0 milestone Jun 27, 2018
@gfyoung gfyoung requested a review from jreback June 27, 2018 07:16
@codecov
Copy link

codecov bot commented Jun 27, 2018

Codecov Report

Merging #21650 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21650      +/-   ##
==========================================
+ Coverage    91.9%    91.9%   +<.01%     
==========================================
  Files         154      154              
  Lines       49555    49557       +2     
==========================================
+ Hits        45542    45544       +2     
  Misses       4013     4013
Flag Coverage Δ
#multiple 90.27% <100%> (ø) ⬆️
#single 42.03% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/frame.py 97.19% <ø> (ø) ⬆️
pandas/core/algorithms.py 94.85% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0f978d...4998050. Read the comment docs.

s = Series([10, 9, 8, 7, 7, 7, 7, 6])
result = s.nlargest(4, keep='all')
expected = Series([10, 9, 8, 7, 7, 7, 7])
print(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print statement should be removed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -24,6 +24,7 @@ Other Enhancements
<https://pandas-gbq.readthedocs.io/en/latest/changelog.html#changelog-0-5-0>`__.
(:issue:`21627`)
- New method :meth:`HDFStore.walk` will recursively walk the group hierarchy of an HDF5 file (:issue:`10932`)
- :func:`Series` / :func:`DataFrame` methods :func:`nlargest` / :func:`nsmallest` now accept the value 'all' for the `keep` argument. This keeps all ties for the nth largest/smallest value (:issue:`16818`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these will link properly if the class is separated from the method (I could be wrong though), and I think you'll need to explicitly write out the variations, e.g. :meth:`Series.nlargest`.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung
Copy link
Member Author

gfyoung commented Jun 27, 2018

@jschendel : Thanks for comments! TBH, I didn't really check the PR diff before trying to apply it on master. 😄 I'll get to these shortly.

@gfyoung gfyoung force-pushed the nlargest-keep-last branch from 8d3b883 to 4b7f199 Compare June 27, 2018 16:47
@gfyoung
Copy link
Member Author

gfyoung commented Jun 27, 2018

@jreback @jschendel : Comments addressed, and all is passing. PTAL.

@jschendel
Copy link
Member

jschendel commented Jun 27, 2018

Can you add 'all' to the list of params for the Series and DataFrame asv's related to NSort:

class NSort(object):
goal_time = 0.2
params = ['last', 'first']

class NSort(object):
goal_time = 0.2
params = ['first', 'last']

I don't think this change will have any perf impact, so no need to run the asv's, but might be nice to have benchmarks in place for this.

LGTM otherwise.

@gfyoung gfyoung force-pushed the nlargest-keep-last branch from 4b7f199 to a07c686 Compare June 28, 2018 00:15
@gfyoung
Copy link
Member Author

gfyoung commented Jun 28, 2018

@jschendel : Yep, done.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. very minor comments. merge on green.

4 8 e 4.0

When using ``keep='all'``, all duplicate items are maintained
>>> df.nsmallest(3, 'a', keep='all')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, blank line here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

``TypeError``:

>>> df.nsmallest(3, 'b')
Traceback (most recent call last):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -2461,6 +2461,22 @@ def test_n_duplicate_index(self, df_duplicates, n, order):
expected = df.sort_values(order, ascending=False).head(n)
tm.assert_frame_equal(result, expected)

def test_keep_all_ties(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you name: test_duplicate_keep_all_ties

side issue: I think make sense to split out duplicate tests to separate file (can you make an issue / PR if you feel!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -2082,6 +2082,17 @@ def test_boundary_datetimelike(self, nselect_method, dtype):
vals = [min_val + 1, min_val + 2, max_val - 1, max_val, min_val]
assert_check_nselect_boundary(vals, dtype, nselect_method)

def test_keep_all_ties(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same naming here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -24,6 +24,7 @@ Other Enhancements
<https://pandas-gbq.readthedocs.io/en/latest/changelog.html#changelog-0-5-0>`__.
(:issue:`21627`)
- New method :meth:`HDFStore.walk` will recursively walk the group hierarchy of an HDF5 file (:issue:`10932`)
- :meth:`Series.nlargest`, :meth:`Series.nsmallest`, :meth:`DataFrame.nlargest`, and :meth:`DataFrame.nsmallest` now accept the value ``"all"`` for the ``keep` argument. This keeps all ties for the nth largest/smallest value (:issue:`16818`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add both issue numbers here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second issue number #18656 is an actually the old PR. I mentioned it for completeness.

@gfyoung gfyoung force-pushed the nlargest-keep-last branch from a07c686 to 4998050 Compare June 28, 2018 17:01
@gfyoung
Copy link
Member Author

gfyoung commented Jun 28, 2018

All is green, so merging.

@gfyoung gfyoung merged commit 0801b8c into pandas-dev:master Jun 28, 2018
@gfyoung gfyoung deleted the nlargest-keep-last branch June 28, 2018 21:20
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API Design Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants