Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/POC: Share groupby/series algos (rank) #38744

Merged
merged 20 commits into from
Jan 1, 2021

Conversation

mzeitlin11
Copy link
Member

@mzeitlin11 mzeitlin11 commented Dec 28, 2020

  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

Right now there are 3 rank algorithms which are extremely similar, but have slight differences. This seems to cause some maintenance pain. First, inconsistencies between implementations allows more bug potential (like #32593, only a bug for rank_2d). For issues like #32859 which affect all 3, a fix would have to be applied in 3 places and testing added separately for Series, DataFrame, and GroupBy. This pr attempts to mitigate these issues by combining the implementations group_rank and rank_1d (which as an additional bonus gives the enhancement of object support for GroupBy.rank() (#38278)).

Is this kind of refactor/deduplication helpful? If so, similar logic can probably be applied elsewhere.

The diff here makes the changes look more complicated than they are because rank_1d is essentially replaced by group_rank. The original group_rank implementation is only slightly changed to allow for optional labels and object support.

Benchmarks look ok:

      before           after         ratio
     [9f1a41de]       [5cd81d25]
     <master>         <ref/rank>
       7.87±0.1ms       7.77±0.1ms     0.99  frame_methods.Rank.time_rank('float')
       2.61±0.1ms       2.55±0.1ms     0.97  frame_methods.Rank.time_rank('int')
         57.9±3ms         59.0±6ms     1.02  frame_methods.Rank.time_rank('object')
      2.67±0.04ms      2.58±0.08ms     0.97  frame_methods.Rank.time_rank('uint')
      
       10.8±0.4ms       9.45±0.3ms    ~0.87  series_methods.Rank.time_rank('float')
       7.39±0.2ms         6.79±1ms     0.92  series_methods.Rank.time_rank('int')
         52.9±4ms         47.7±2ms    ~0.90  series_methods.Rank.time_rank('object')
         7.32±1ms       6.58±0.3ms    ~0.90  series_methods.Rank.time_rank('uint')
         
         
          314±3μs         352±40μs    ~1.12  groupby.GroupByMethods.time_dtype_as_field('datetime', 'rank', 'direct')
          316±8μs         322±10μs     1.02  groupby.GroupByMethods.time_dtype_as_field('datetime', 'rank', 'transformation')
         421±10μs         475±60μs    ~1.13  groupby.GroupByMethods.time_dtype_as_field('float', 'rank', 'direct')
         409±10μs         482±60μs    ~1.18  groupby.GroupByMethods.time_dtype_as_field('float', 'rank', 'transformation')
          505±3μs         420±10μs    ~0.83  groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'direct')
-        510±20μs          410±3μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'transformation')
         411±20μs         441±60μs     1.07  groupby.GroupByMethods.time_dtype_as_group('datetime', 'rank', 'direct')
         486±30μs         509±10μs     1.05  groupby.GroupByMethods.time_dtype_as_group('datetime', 'rank', 'transformation')
         470±50μs          424±9μs    ~0.90  groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'direct')
         409±10μs         505±60μs    ~1.23  groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'transformation')
          404±7μs         470±60μs    ~1.16  groupby.GroupByMethods.time_dtype_as_group('int', 'rank', 'direct')
          407±7μs          412±9μs     1.01  groupby.GroupByMethods.time_dtype_as_group('int', 'rank', 'transformation')

# decrement that from their position. fill in the size of each
# group encountered (used by pct calculations later). also be
# sure to reset any of the items helping to calculate dups
if at_end or (check_labels and (labels_[_as[i]] != labels_[_as[i+1]])):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check_labels is not necessary for correctness since labels_ is zeroed if no labels are passed, but short-circuiting to avoid the unnecessary check labels_[_as[i]] != labels_[_as[i+1]] seemed to give a small perf boost

@@ -444,6 +444,7 @@ def test_rank_avg_even_vals():
tm.assert_frame_equal(result, exp_df)


@pytest.mark.xfail(reason="Works now, needs tests")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep this PR focused on the approach to deduplication, plan to leave this for a followup (basically add whatsnew, tests, which will close #38278)

Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks cool!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make the diff easier on the eyes. can you do a pre-cursor PR which just straight up moves things that you are going to need.

@jreback
Copy link
Contributor

jreback commented Dec 28, 2020

Is this kind of refactor/deduplication helpful? If so, similar logic can probably be applied elsewhere.

yes absolutely. I am not sure if we have an issue about this, but certianly would take things like this. note these might be quite tricky because there are possibilities of sublte differences between the algos & performance is always a concern (e.g. the 1-d with no grouping are often faster than an equivalent groupby with a single group. but the code unification is more important).

@mzeitlin11
Copy link
Member Author

mzeitlin11 commented Dec 28, 2020

to make the diff easier on the eyes. can you do a pre-cursor PR which just straight up moves things that you are going to need.

Was worried this would be an issue, but not sure how to clean up. The diff is so bad because rank_1d is being replaced with group_rank, so the diff essentially compares the original rank_1d to the original group_rank (so just this single simple replacement makes the diff useless). The helpful diff here would be the new rank_1d vs the old group_rank. I can't think of anything which can be moved as a precursor that won't break existing functionality (other than orthogonal additions like the benchmarks), any advice here would be much appreciated if you had something specific in mind.

@jreback
Copy link
Contributor

jreback commented Dec 28, 2020

to make the diff easier on the eyes. can you do a pre-cursor PR which just straight up moves things that you are going to need.

Was worried this would be an issue, but not sure how to clean up. The diff is so bad because rank_1d is being replaced with group_rank, so the diff essentially compares the original rank_1d to the original group_rank (so just this single simple replacement makes the diff useless). The helpful diff here would be the new rank_1d vs the old group_rank. I can't think of anything which can be moved as a precursor that won't break existing functionality (other than orthogonal additions like the benchmarks), any advice here would be much appreciated if you had something specific in mind.

ok its fine if a precusor is not helpful (e.g. just do this one)

@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

can you show the asv's vs master

@mzeitlin11
Copy link
Member Author

can you show the asv's vs master

They're in a details block in the pr body

@jreback
Copy link
Contributor

jreback commented Dec 30, 2020

can you merge master

@mzeitlin11
Copy link
Member Author

Merged master

values = np.asarray(in_arr).copy()
elif rank_t is object:
values = np.array(in_arr, copy=True)
N = len(in_arr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you assert len(labels) == N

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# Copy values into new array in order to fill missing data
# with mask, without obfuscating location of missing data
# in values array
masked_vals = np.array(in_arr, copy=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in_arr.copy() ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# in values array
masked_vals = np.array(in_arr, copy=True)
if rank_t is object and masked_vals.dtype != np.object_:
masked_vals = masked_vals.astype('O')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do this for object types (e.g. do an else and then)
mask_vals = in_array.copy() since .astype always copies

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@jreback
Copy link
Contributor

jreback commented Dec 31, 2020

cc @jbrockmendel

for i in range(N):
# We don't include NaN values in percentage
# rankings, so we assign them percentages of NaN.
if out[i] != out[i] or out[i] == NaN:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does out[i] == NaN ever hold?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it is a strange condition, not sure why it was there in the original code. Just removing that entire if statement doesn't seem to change the logic (I think it was just saying "if nan, then set as nan").

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm a few nits

# each label corresponds to a different group value,
# the mask helps you differentiate missing values before
# performing sort on the actual values
_as = np.lexsort(order).astype(np.int64, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rename _as to something more readable, maybe lexsort_indexer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jreback jreback added this to the 1.3 milestone Dec 31, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment can catch in a followup, thanks @mzeitlin11

grp_sizes = np.ones(N)
# If all 0 labels, can short-circuit later label
# comparisons
check_labels = np.any(labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in your followon can you put a blank line before comments, easier to read


if rank_t is object:
_as = np.lexsort(keys=order)
if ascending ^ (na_option == 'top'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we add a comment here on what this is doing

@jreback jreback merged commit 2b4bcf2 into pandas-dev:master Jan 1, 2021
@mzeitlin11 mzeitlin11 deleted the ref/rank branch January 1, 2021 00:21
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
@mzeitlin11 mzeitlin11 mentioned this pull request Mar 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants