Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: making mode stable/keeping original ordering #39353

Merged
merged 6 commits into from
Jan 27, 2021

Conversation

realead
Copy link
Contributor

@realead realead commented Jan 23, 2021

This is follow up of #39009 but for mode: now keys returned by mode are in the original order. This is also improvement of the code quality as mode now uses the same code path as value_counts.

Fixes a performance regression of #39009.

@jreback jreback added the Bug label Jan 24, 2021
@jreback jreback added this to the 1.3 milestone Jan 24, 2021
@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jan 24, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm small coment

# thus we have 2 nans and 1 True
modes = ht.mode_object(values, False)
assert modes.size == 1
assert checknull(modes[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use is_na instead of the internal checknull

@jreback
Copy link
Contributor

jreback commented Jan 25, 2021

can you run the mode asvs (doubt we have too many), value_counts & some indexing ones just to make sure nothing is changing here

@realead
Copy link
Contributor Author

realead commented Jan 25, 2021

I have added some asv-benchmarks, because otherwise mode seems to be not covered in asv_bench.

For running asv continuous -f 1.10 upstream/master HEAD -b series_methods.Mode -b ^indexing I got:

       before           after         ratio
     [309cf3a6]       [e91e27a2]
+        18.6±1ms         51.3±4ms     2.76  series_methods.Mode.time_mode(100000, 'int')
+      18.9±0.5ms         38.4±5ms     2.03  series_methods.Mode.time_mode(100000, 'uint')
+      43.8±0.4ms         88.9±7ms     2.03  series_methods.Mode.time_mode(100000, 'float')
+     1.19±0.03ms      1.55±0.02ms     1.30  series_methods.Mode.time_mode(10000, 'int')
+     1.20±0.01ms      1.52±0.03ms     1.26  series_methods.Mode.time_mode(10000, 'uint')
+     1.89±0.08ms      2.39±0.04ms     1.26  series_methods.Mode.time_mode(10000, 'float')
+         134±2ms          153±2ms     1.15  series_methods.Mode.time_mode(100000, 'object')

There is a slow down, as soon as the table with keys no longer fits the cache (10^3 is OK, 10^5 isn't). More cache-misses should be expected, due to an additional indirection in value_counts_xxx, which is needed for keeping the original order.

Looking at the results it looks as if the array-sizes in asv-tests for value_counts are too small, otherwise this regression should be visible in asv-run in #39009 as well.

@realead
Copy link
Contributor Author

realead commented Jan 25, 2021

When I change series_methods.ValueCounts similar to ecf9ff6, i.e. adding different array-sizes, and compare with asv PR #39009 I get the following:

       before           after         ratio
     [2f883215]       [ecf9ff6e]
+      25.6±0.8ms         74.5±1ms     2.91  series_methods.ValueCounts.time_value_counts(100000, 'int')
+        26.4±1ms         73.1±5ms     2.76  series_methods.ValueCounts.time_value_counts(100000, 'uint')
+      51.2±0.7ms         123±20ms     2.40  series_methods.ValueCounts.time_value_counts(100000, 'float')
-         212±9ms          176±3ms     0.83  series_methods.ValueCounts.time_value_counts(100000, 'object')
-     1.50±0.02ms      1.05±0.02ms     0.70  series_methods.ValueCounts.time_value_counts(1000, 'object')
-      12.5±0.1ms       8.57±0.1ms     0.69  series_methods.ValueCounts.time_value_counts(10000, 'object')

With smaller sizes (as it is currently the case), the issues with cache misses aren't visible and one sees only speed-up in case of object (because some pre-/post-processing is no longer needed):

       before           after         ratio
     [2f883215]       [ecf9ff6e]
-     1.50±0.02ms      1.05±0.02ms     0.70  series_methods.ValueCounts.time_value_counts(1000, 'object')

@jreback
Copy link
Contributor

jreback commented Jan 26, 2021

is the build_count_table just a special case version of value_counts that doesn't keep order (and that is why its faster)?

@realead
Copy link
Contributor Author

realead commented Jan 26, 2021

@jreback

is the build_count_table just a special case version of value_counts that doesn't keep order (and that is why its faster)?

that is correct. However, my implementation is a little bit naive: I have missed, that there are more cache misses now (misses for look-up in table + miss for lookup in the line result_counts.data.data[unique_key_index] += 1 vs misses only for look-up table in the old version) and it really counts for bigger arrays.

I think there is a way to improve the performance (which would also improve value_counts as well) making it almost as fast as the old version. I will try it.

@realead
Copy link
Contributor Author

realead commented Jan 26, 2021

Comparison to the state before #39009 was merged (-b series_methods.ValueCounts -b series_methods.Mode):

       before           after         ratio
     [2f883215]       [31512ea5]
-         251±5ms          197±4ms     0.79  series_methods.ValueCounts.time_value_counts(100000, 'object')
-      13.3±0.8ms       9.36±0.8ms     0.70  series_methods.ValueCounts.time_value_counts(10000, 'object')
-      1.73±0.1ms      1.18±0.08ms     0.68  series_methods.ValueCounts.time_value_counts(1000, 'object')

no additional cache misses and better times for object, because post processing was removed.

Compared to the current master:

       before           after         ratio
     [309cf3a6]       [31512ea5]
-         165±4ms         62.7±2ms     0.38  series_methods.ValueCounts.time_value_counts(100000, 'float')
-         103±2ms         31.9±2ms     0.31  series_methods.ValueCounts.time_value_counts(100000, 'int')
-         106±2ms         33.0±2ms     0.31  series_methods.ValueCounts.time_value_counts(100000, 'uint')

Performance regression of #39009 has been reverted.

@jreback jreback merged commit 984def2 into pandas-dev:master Jan 27, 2021
@jreback
Copy link
Contributor

jreback commented Jan 27, 2021

very nice thanks @realead keep em coming!

@realead realead deleted the consistent_mode branch January 27, 2021 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Inconsistent results of value_counts and mode for different nan-values
2 participants