-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF/POC: Share groupby/series algos (rank) #38744
Conversation
pandas/_libs/algos.pyx
Outdated
# decrement that from their position. fill in the size of each | ||
# group encountered (used by pct calculations later). also be | ||
# sure to reset any of the items helping to calculate dups | ||
if at_end or (check_labels and (labels_[_as[i]] != labels_[_as[i+1]])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check_labels
is not necessary for correctness since labels_
is zeroed if no labels are passed, but short-circuiting to avoid the unnecessary check labels_[_as[i]] != labels_[_as[i+1]]
seemed to give a small perf boost
@@ -444,6 +444,7 @@ def test_rank_avg_even_vals(): | |||
tm.assert_frame_equal(result, exp_df) | |||
|
|||
|
|||
@pytest.mark.xfail(reason="Works now, needs tests") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep this PR focused on the approach to deduplication, plan to leave this for a followup (basically add whatsnew, tests, which will close #38278)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make the diff easier on the eyes. can you do a pre-cursor PR which just straight up moves things that you are going to need.
yes absolutely. I am not sure if we have an issue about this, but certianly would take things like this. note these might be quite tricky because there are possibilities of sublte differences between the algos & performance is always a concern (e.g. the 1-d with no grouping are often faster than an equivalent groupby with a single group. but the code unification is more important). |
Was worried this would be an issue, but not sure how to clean up. The diff is so bad because |
ok its fine if a precusor is not helpful (e.g. just do this one) |
can you show the asv's vs master |
They're in a |
can you merge master |
Merged master |
pandas/_libs/algos.pyx
Outdated
values = np.asarray(in_arr).copy() | ||
elif rank_t is object: | ||
values = np.array(in_arr, copy=True) | ||
N = len(in_arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you assert len(labels) == N
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/_libs/algos.pyx
Outdated
# Copy values into new array in order to fill missing data | ||
# with mask, without obfuscating location of missing data | ||
# in values array | ||
masked_vals = np.array(in_arr, copy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in_arr.copy()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/_libs/algos.pyx
Outdated
# in values array | ||
masked_vals = np.array(in_arr, copy=True) | ||
if rank_t is object and masked_vals.dtype != np.object_: | ||
masked_vals = masked_vals.astype('O') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just do this for object types (e.g. do an else and then)
mask_vals = in_array.copy()
since .astype always copies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
pandas/_libs/algos.pyx
Outdated
for i in range(N): | ||
# We don't include NaN values in percentage | ||
# rankings, so we assign them percentages of NaN. | ||
if out[i] != out[i] or out[i] == NaN: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does out[i] == NaN
ever hold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it is a strange condition, not sure why it was there in the original code. Just removing that entire if statement doesn't seem to change the logic (I think it was just saying "if nan, then set as nan").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm a few nits
pandas/_libs/algos.pyx
Outdated
# each label corresponds to a different group value, | ||
# the mask helps you differentiate missing values before | ||
# performing sort on the actual values | ||
_as = np.lexsort(order).astype(np.int64, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rename _as to something more readable, maybe lexsort_indexer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment can catch in a followup, thanks @mzeitlin11
grp_sizes = np.ones(N) | ||
# If all 0 labels, can short-circuit later label | ||
# comparisons | ||
check_labels = np.any(labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in your followon can you put a blank line before comments, easier to read
|
||
if rank_t is object: | ||
_as = np.lexsort(keys=order) | ||
if ascending ^ (na_option == 'top'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally we add a comment here on what this is doing
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Right now there are 3 rank algorithms which are extremely similar, but have slight differences. This seems to cause some maintenance pain. First, inconsistencies between implementations allows more bug potential (like #32593, only a bug for
rank_2d
). For issues like #32859 which affect all 3, a fix would have to be applied in 3 places and testing added separately forSeries
,DataFrame
, andGroupBy
. This pr attempts to mitigate these issues by combining the implementationsgroup_rank
andrank_1d
(which as an additional bonus gives the enhancement of object support forGroupBy.rank()
(#38278)).Is this kind of refactor/deduplication helpful? If so, similar logic can probably be applied elsewhere.
The diff here makes the changes look more complicated than they are because
rank_1d
is essentially replaced bygroup_rank
. The originalgroup_rank
implementation is only slightly changed to allow for optional labels and object support.Benchmarks look ok: