ENH: making mode stable/keeping original ordering #39353

realead · 2021-01-23T19:09:47Z

closes BUG: Inconsistent results of value_counts and mode for different nan-values #39007
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

This is follow up of #39009 but for mode: now keys returned by mode are in the original order. This is also improvement of the code quality as mode now uses the same code path as value_counts.

Fixes a performance regression of #39009.

jreback

lgtm small coment

jreback · 2021-01-24T22:06:11Z

pandas/tests/libs/test_hashtable.py

+    # thus we have 2 nans and 1 True
+    modes = ht.mode_object(values, False)
+    assert modes.size == 1
+    assert checknull(modes[0])


can you use is_na instead of the internal checknull

jreback · 2021-01-25T19:57:57Z

can you run the mode asvs (doubt we have too many), value_counts & some indexing ones just to make sure nothing is changing here

realead · 2021-01-25T21:37:04Z

I have added some asv-benchmarks, because otherwise mode seems to be not covered in asv_bench.

For running asv continuous -f 1.10 upstream/master HEAD -b series_methods.Mode -b ^indexing I got:

       before           after         ratio
     [309cf3a6]       [e91e27a2]
+        18.6±1ms         51.3±4ms     2.76  series_methods.Mode.time_mode(100000, 'int')
+      18.9±0.5ms         38.4±5ms     2.03  series_methods.Mode.time_mode(100000, 'uint')
+      43.8±0.4ms         88.9±7ms     2.03  series_methods.Mode.time_mode(100000, 'float')
+     1.19±0.03ms      1.55±0.02ms     1.30  series_methods.Mode.time_mode(10000, 'int')
+     1.20±0.01ms      1.52±0.03ms     1.26  series_methods.Mode.time_mode(10000, 'uint')
+     1.89±0.08ms      2.39±0.04ms     1.26  series_methods.Mode.time_mode(10000, 'float')
+         134±2ms          153±2ms     1.15  series_methods.Mode.time_mode(100000, 'object')

There is a slow down, as soon as the table with keys no longer fits the cache (10^3 is OK, 10^5 isn't). More cache-misses should be expected, due to an additional indirection in value_counts_xxx, which is needed for keeping the original order.

Looking at the results it looks as if the array-sizes in asv-tests for value_counts are too small, otherwise this regression should be visible in asv-run in #39009 as well.

realead · 2021-01-25T22:03:17Z

When I change series_methods.ValueCounts similar to ecf9ff6, i.e. adding different array-sizes, and compare with asv PR #39009 I get the following:

       before           after         ratio
     [2f883215]       [ecf9ff6e]
+      25.6±0.8ms         74.5±1ms     2.91  series_methods.ValueCounts.time_value_counts(100000, 'int')
+        26.4±1ms         73.1±5ms     2.76  series_methods.ValueCounts.time_value_counts(100000, 'uint')
+      51.2±0.7ms         123±20ms     2.40  series_methods.ValueCounts.time_value_counts(100000, 'float')
-         212±9ms          176±3ms     0.83  series_methods.ValueCounts.time_value_counts(100000, 'object')
-     1.50±0.02ms      1.05±0.02ms     0.70  series_methods.ValueCounts.time_value_counts(1000, 'object')
-      12.5±0.1ms       8.57±0.1ms     0.69  series_methods.ValueCounts.time_value_counts(10000, 'object')

With smaller sizes (as it is currently the case), the issues with cache misses aren't visible and one sees only speed-up in case of object (because some pre-/post-processing is no longer needed):

       before           after         ratio
     [2f883215]       [ecf9ff6e]
-     1.50±0.02ms      1.05±0.02ms     0.70  series_methods.ValueCounts.time_value_counts(1000, 'object')

jreback · 2021-01-26T00:07:39Z

is the build_count_table just a special case version of value_counts that doesn't keep order (and that is why its faster)?

realead · 2021-01-26T10:01:26Z

@jreback

is the build_count_table just a special case version of value_counts that doesn't keep order (and that is why its faster)?

that is correct. However, my implementation is a little bit naive: I have missed, that there are more cache misses now (misses for look-up in table + miss for lookup in the line result_counts.data.data[unique_key_index] += 1 vs misses only for look-up table in the old version) and it really counts for bigger arrays.

I think there is a way to improve the performance (which would also improve value_counts as well) making it almost as fast as the old version. I will try it.

… for arrays which do not fit the cache

realead · 2021-01-26T20:21:29Z

Comparison to the state before #39009 was merged (-b series_methods.ValueCounts -b series_methods.Mode):

       before           after         ratio
     [2f883215]       [31512ea5]
-         251±5ms          197±4ms     0.79  series_methods.ValueCounts.time_value_counts(100000, 'object')
-      13.3±0.8ms       9.36±0.8ms     0.70  series_methods.ValueCounts.time_value_counts(10000, 'object')
-      1.73±0.1ms      1.18±0.08ms     0.68  series_methods.ValueCounts.time_value_counts(1000, 'object')

no additional cache misses and better times for object, because post processing was removed.

Compared to the current master:

       before           after         ratio
     [309cf3a6]       [31512ea5]
-         165±4ms         62.7±2ms     0.38  series_methods.ValueCounts.time_value_counts(100000, 'float')
-         103±2ms         31.9±2ms     0.31  series_methods.ValueCounts.time_value_counts(100000, 'int')
-         106±2ms         33.0±2ms     0.31  series_methods.ValueCounts.time_value_counts(100000, 'uint')

Performance regression of #39009 has been reverted.

jreback · 2021-01-27T14:04:00Z

very nice thanks @realead keep em coming!

realead added 2 commits January 23, 2021 19:42

use value_counts for mode

40b451d

what is new

55f3d19

jreback added the Bug label Jan 24, 2021

jreback added this to the 1.3 milestone Jan 24, 2021

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jan 24, 2021

jreback requested changes Jan 24, 2021

View reviewed changes

do not use isnull

e91e27a

adding asv_bench for mode

ecf9ff6

realead added 2 commits January 26, 2021 20:33

extending asv test case, so also cache misses play a role

a7b8e6e

avoiding one indirection and thus cache misses, improving performance…

31512ea

… for arrays which do not fit the cache

jreback approved these changes Jan 27, 2021

View reviewed changes

jreback merged commit 984def2 into pandas-dev:master Jan 27, 2021

realead deleted the consistent_mode branch January 27, 2021 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: making mode stable/keeping original ordering #39353

ENH: making mode stable/keeping original ordering #39353

realead commented Jan 23, 2021 •

edited

Loading

jreback left a comment

jreback Jan 24, 2021

jreback commented Jan 25, 2021

realead commented Jan 25, 2021

realead commented Jan 25, 2021

jreback commented Jan 26, 2021

realead commented Jan 26, 2021

realead commented Jan 26, 2021 •

edited

Loading

jreback commented Jan 27, 2021

ENH: making mode stable/keeping original ordering #39353

ENH: making mode stable/keeping original ordering #39353

Conversation

realead commented Jan 23, 2021 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

jreback Jan 24, 2021

Choose a reason for hiding this comment

jreback commented Jan 25, 2021

realead commented Jan 25, 2021

realead commented Jan 25, 2021

jreback commented Jan 26, 2021

realead commented Jan 26, 2021

realead commented Jan 26, 2021 • edited Loading

jreback commented Jan 27, 2021

realead commented Jan 23, 2021 •

edited

Loading

realead commented Jan 26, 2021 •

edited

Loading