REF/POC: Share groupby/series algos (rank) #38744

mzeitlin11 · 2020-12-28T05:40:50Z

passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff

Right now there are 3 rank algorithms which are extremely similar, but have slight differences. This seems to cause some maintenance pain. First, inconsistencies between implementations allows more bug potential (like #32593, only a bug for rank_2d). For issues like #32859 which affect all 3, a fix would have to be applied in 3 places and testing added separately for Series, DataFrame, and GroupBy. This pr attempts to mitigate these issues by combining the implementations group_rank and rank_1d (which as an additional bonus gives the enhancement of object support for GroupBy.rank() (#38278)).

Is this kind of refactor/deduplication helpful? If so, similar logic can probably be applied elsewhere.

The diff here makes the changes look more complicated than they are because rank_1d is essentially replaced by group_rank. The original group_rank implementation is only slightly changed to allow for optional labels and object support.

Benchmarks look ok:

      before           after         ratio
     [9f1a41de]       [5cd81d25]
     <master>         <ref/rank>
       7.87±0.1ms       7.77±0.1ms     0.99  frame_methods.Rank.time_rank('float')
       2.61±0.1ms       2.55±0.1ms     0.97  frame_methods.Rank.time_rank('int')
         57.9±3ms         59.0±6ms     1.02  frame_methods.Rank.time_rank('object')
      2.67±0.04ms      2.58±0.08ms     0.97  frame_methods.Rank.time_rank('uint')
      
       10.8±0.4ms       9.45±0.3ms    ~0.87  series_methods.Rank.time_rank('float')
       7.39±0.2ms         6.79±1ms     0.92  series_methods.Rank.time_rank('int')
         52.9±4ms         47.7±2ms    ~0.90  series_methods.Rank.time_rank('object')
         7.32±1ms       6.58±0.3ms    ~0.90  series_methods.Rank.time_rank('uint')
         
         
          314±3μs         352±40μs    ~1.12  groupby.GroupByMethods.time_dtype_as_field('datetime', 'rank', 'direct')
          316±8μs         322±10μs     1.02  groupby.GroupByMethods.time_dtype_as_field('datetime', 'rank', 'transformation')
         421±10μs         475±60μs    ~1.13  groupby.GroupByMethods.time_dtype_as_field('float', 'rank', 'direct')
         409±10μs         482±60μs    ~1.18  groupby.GroupByMethods.time_dtype_as_field('float', 'rank', 'transformation')
          505±3μs         420±10μs    ~0.83  groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'direct')
-        510±20μs          410±3μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'transformation')
         411±20μs         441±60μs     1.07  groupby.GroupByMethods.time_dtype_as_group('datetime', 'rank', 'direct')
         486±30μs         509±10μs     1.05  groupby.GroupByMethods.time_dtype_as_group('datetime', 'rank', 'transformation')
         470±50μs          424±9μs    ~0.90  groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'direct')
         409±10μs         505±60μs    ~1.23  groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'transformation')
          404±7μs         470±60μs    ~1.16  groupby.GroupByMethods.time_dtype_as_group('int', 'rank', 'direct')
          407±7μs          412±9μs     1.01  groupby.GroupByMethods.time_dtype_as_group('int', 'rank', 'transformation')

pandas/_libs/algos.pyx

mzeitlin11 · 2020-12-28T05:48:10Z

pandas/_libs/algos.pyx

+            # decrement that from their position. fill in the size of each
+            # group encountered (used by pct calculations later). also be
+            # sure to reset any of the items helping to calculate dups
+            if at_end or (check_labels and (labels_[_as[i]] != labels_[_as[i+1]])):


The check_labels is not necessary for correctness since labels_ is zeroed if no labels are passed, but short-circuiting to avoid the unnecessary check labels_[_as[i]] != labels_[_as[i+1]] seemed to give a small perf boost

mzeitlin11 · 2020-12-28T05:50:19Z

pandas/tests/groupby/test_rank.py

@@ -444,6 +444,7 @@ def test_rank_avg_even_vals():
    tm.assert_frame_equal(result, exp_df)


+@pytest.mark.xfail(reason="Works now, needs tests")


To keep this PR focused on the approach to deduplication, plan to leave this for a followup (basically add whatsnew, tests, which will close #38278)

arw2019

This looks cool!

asv_bench/benchmarks/frame_methods.py

jreback

to make the diff easier on the eyes. can you do a pre-cursor PR which just straight up moves things that you are going to need.

asv_bench/benchmarks/groupby.py

pandas/_libs/algos.pyx

jreback · 2020-12-28T19:46:55Z

Is this kind of refactor/deduplication helpful? If so, similar logic can probably be applied elsewhere.

yes absolutely. I am not sure if we have an issue about this, but certianly would take things like this. note these might be quite tricky because there are possibilities of sublte differences between the algos & performance is always a concern (e.g. the 1-d with no grouping are often faster than an equivalent groupby with a single group. but the code unification is more important).

mzeitlin11 · 2020-12-28T20:45:48Z

to make the diff easier on the eyes. can you do a pre-cursor PR which just straight up moves things that you are going to need.

Was worried this would be an issue, but not sure how to clean up. The diff is so bad because rank_1d is being replaced with group_rank, so the diff essentially compares the original rank_1d to the original group_rank (so just this single simple replacement makes the diff useless). The helpful diff here would be the new rank_1d vs the old group_rank. I can't think of anything which can be moved as a precursor that won't break existing functionality (other than orthogonal additions like the benchmarks), any advice here would be much appreciated if you had something specific in mind.

jreback · 2020-12-28T23:27:04Z

to make the diff easier on the eyes. can you do a pre-cursor PR which just straight up moves things that you are going to need.

Was worried this would be an issue, but not sure how to clean up. The diff is so bad because rank_1d is being replaced with group_rank, so the diff essentially compares the original rank_1d to the original group_rank (so just this single simple replacement makes the diff useless). The helpful diff here would be the new rank_1d vs the old group_rank. I can't think of anything which can be moved as a precursor that won't break existing functionality (other than orthogonal additions like the benchmarks), any advice here would be much appreciated if you had something specific in mind.

ok its fine if a precusor is not helpful (e.g. just do this one)

jreback · 2020-12-29T20:32:03Z

can you show the asv's vs master

mzeitlin11 · 2020-12-29T20:39:00Z

can you show the asv's vs master

They're in a details block in the pr body

jreback · 2020-12-30T21:41:37Z

can you merge master

mzeitlin11 · 2020-12-30T22:43:27Z

Merged master

jreback · 2020-12-31T01:58:30Z

pandas/_libs/algos.pyx

-        values = np.asarray(in_arr).copy()
-    elif rank_t is object:
-        values = np.array(in_arr, copy=True)
+    N = len(in_arr)


can you assert len(labels) == N

jreback · 2020-12-31T01:59:00Z

pandas/_libs/algos.pyx

+    # Copy values into new array in order to fill missing data
+    # with mask, without obfuscating location of missing data
+    # in values array
+    masked_vals = np.array(in_arr, copy=True)


in_arr.copy() ?

jreback · 2020-12-31T02:00:52Z

pandas/_libs/algos.pyx

+    # in values array
+    masked_vals = np.array(in_arr, copy=True)
+    if rank_t is object and masked_vals.dtype != np.object_:
+        masked_vals = masked_vals.astype('O')


you can just do this for object types (e.g. do an else and then)
mask_vals = in_array.copy() since .astype always copies

jreback · 2020-12-31T02:01:53Z

cc @jbrockmendel

pandas/_libs/algos.pyx

jbrockmendel · 2020-12-31T16:36:26Z

pandas/_libs/algos.pyx

+        for i in range(N):
+            # We don't include NaN values in percentage
+            # rankings, so we assign them percentages of NaN.
+            if out[i] != out[i] or out[i] == NaN:


does out[i] == NaN ever hold?

Hmm, it is a strange condition, not sure why it was there in the original code. Just removing that entire if statement doesn't seem to change the logic (I think it was just saying "if nan, then set as nan").

jreback

lgtm a few nits

pandas/_libs/algos.pyx

jreback · 2020-12-31T19:43:27Z

pandas/_libs/algos.pyx

+    # each label corresponds to a different group value,
+    # the mask helps you differentiate missing values before
+    # performing sort on the actual values
+    _as = np.lexsort(order).astype(np.int64, copy=False)


can you rename _as to something more readable, maybe lexsort_indexer

jreback

small comment can catch in a followup, thanks @mzeitlin11

jreback · 2021-01-01T00:06:30Z

pandas/_libs/algos.pyx

+    grp_sizes = np.ones(N)
+    # If all 0 labels, can short-circuit later label
+    # comparisons
+    check_labels = np.any(labels)


in your followon can you put a blank line before comments, easier to read

jreback · 2021-01-01T00:07:17Z

pandas/_libs/algos.pyx

-
-    if rank_t is object:
-        _as = np.lexsort(keys=order)
+    if ascending ^ (na_option == 'top'):


ideally we add a comment here on what this is doing

mzeitlin11 added 8 commits December 27, 2020 00:18

WIP

9695605

wip

cc7b73f

wip

54e2397

wip

f0b5edb

Merge remote-tracking branch 'upstream/master' into ref/rank

5d770c3

WIP

a0abb1a

REF/POC: share groupby/series algos (rank)

7db40c3

Fix precommit

5cd81d2

mzeitlin11 commented Dec 28, 2020

View reviewed changes

pandas/_libs/algos.pyx Show resolved Hide resolved

mzeitlin11 commented Dec 28, 2020

View reviewed changes

arw2019 reviewed Dec 28, 2020

View reviewed changes

asv_bench/benchmarks/frame_methods.py Show resolved Hide resolved

Fix dtype

2a53d7c

jreback added Clean Groupby labels Dec 28, 2020

jreback requested changes Dec 28, 2020

View reviewed changes

asv_bench/benchmarks/groupby.py Show resolved Hide resolved

pandas/_libs/algos.pyx Outdated Show resolved Hide resolved

mzeitlin11 added 2 commits December 28, 2020 20:58

Merge remote-tracking branch 'upstream/master' into ref/rank

4b87ebb

Add gil block, always pass labels

6119f4d

mzeitlin11 mentioned this pull request Dec 30, 2020

BUG: DataFrame.rank with np.inf and np.nan #38681

Merged

5 tasks

Merge master

d102fa1

jreback requested changes Dec 31, 2020

View reviewed changes

Address comments

0ab6b0f

mzeitlin11 added 2 commits December 30, 2020 22:25

Merge remote-tracking branch 'upstream/master' into ref/rank

cac7e02

Merge remote-tracking branch 'upstream/master' into ref/rank

581fda5