-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: making mode stable/keeping original ordering #39353
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm small coment
pandas/tests/libs/test_hashtable.py
Outdated
# thus we have 2 nans and 1 True | ||
modes = ht.mode_object(values, False) | ||
assert modes.size == 1 | ||
assert checknull(modes[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use is_na instead of the internal checknull
can you run the mode asvs (doubt we have too many), value_counts & some indexing ones just to make sure nothing is changing here |
I have added some asv-benchmarks, because otherwise mode seems to be not covered in asv_bench. For running
There is a slow down, as soon as the table with keys no longer fits the cache (10^3 is OK, 10^5 isn't). More cache-misses should be expected, due to an additional indirection in Looking at the results it looks as if the array-sizes in asv-tests for |
When I change
With smaller sizes (as it is currently the case), the issues with cache misses aren't visible and one sees only speed-up in case of
|
is the build_count_table just a special case version of value_counts that doesn't keep order (and that is why its faster)? |
that is correct. However, my implementation is a little bit naive: I have missed, that there are more cache misses now (misses for look-up in table + miss for lookup in the line I think there is a way to improve the performance (which would also improve |
… for arrays which do not fit the cache
Comparison to the state before #39009 was merged (
no additional cache misses and better times for object, because post processing was removed. Compared to the current master:
Performance regression of #39009 has been reverted. |
very nice thanks @realead keep em coming! |
This is follow up of #39009 but for
mode
: now keys returned bymode
are in the original order. This is also improvement of the code quality asmode
now uses the same code path asvalue_counts
.Fixes a performance regression of #39009.