Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: building MultiIndex with categorical levels #26721

Merged
merged 1 commit into from
Jun 8, 2019

Conversation

0x0L
Copy link
Contributor

@0x0L 0x0L commented Jun 7, 2019

df = pd.DataFrame({
    'a': np.arange(1_000_000, dtype=np.int32),
    'b': np.arange(1_000_000, dtype=np.int64),
    'c': np.arange(1_000_000, dtype=float),
}).astype({'a': 'category', 'b': 'category'})

%timeit df.set_index(['a', 'b'])

On my machine, this takes ~20ms. The current implem takes 140ms.

Any suggestion for tests ?

Sorry, something went wrong.

@0x0L 0x0L force-pushed the categorical_multi_index branch from ae8e348 to 9345aaa Compare June 7, 2019 22:22
@WillAyd
Copy link
Member

WillAyd commented Jun 8, 2019

Any suggestion for tests ?

For performance related items you can create a benchmark in asv_bench/benchmarks probably in the multiindex_object.py module for this (though categorical.py and ctors.py maybe work to dependent on how you build test)

@WillAyd WillAyd added Categorical Categorical Data Type MultiIndex Performance Memory or execution speed performance labels Jun 8, 2019
@0x0L
Copy link
Contributor Author

0x0L commented Jun 8, 2019

@WillAyd
I can't think of any possible downside in performance since this is just avoiding the rebuild of the codes (which are just 0..N-1 by definition) through lookup.

I was thinking more in terms of possible regressions. I'm not sure what I should really test or worry for. But I couldn't find any test using a MultiIndex with at least a CategoricalIndex level...

if isinstance(values, (ABCCategoricalIndex, ABCSeries)):
values = values._values
categories = CategoricalIndex(values.categories, dtype=values.dtype)
values = CategoricalIndex(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here on what is happening

@@ -514,6 +514,7 @@ Performance Improvements
- Improved performance of :meth:`read_csv` by faster concatenating date columns without extra conversion to string for integer/float zero and float ``NaN``; by faster checking the string for the possibility of being a date (:issue:`25754`)
- Improved performance of :attr:`IntervalIndex.is_unique` by removing conversion to ``MultiIndex`` (:issue:`24813`)
- Restored performance of :meth:`DatetimeIndex.__iter__` by re-enabling specialized code path (:issue:`26702`)
- Improved performance of :meth:`DataFrame.set_index` when using multiple indexes and at least one of them is a categorical object (:issue:`22044`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the asv as you have in the example. Also we might not have tests for using multiple cats in a MI, see if you can add one in pandas/tests/indexes/multi/test_constructor which match what the .set_index() is doing at a lower level.

@0x0L 0x0L force-pushed the categorical_multi_index branch from 9345aaa to f3f10bc Compare June 8, 2019 18:07
@pep8speaks
Copy link

pep8speaks commented Jun 8, 2019

Hello @0x0L! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-08 19:56:51 UTC

Verified

This commit was signed with the committer’s verified signature.
0x0L nullptr
@0x0L 0x0L force-pushed the categorical_multi_index branch from f3f10bc to 59a3c67 Compare June 8, 2019 19:56
@jreback jreback added this to the 0.25.0 milestone Jun 8, 2019
@codecov
Copy link

codecov bot commented Jun 8, 2019

Codecov Report

Merging #26721 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26721      +/-   ##
==========================================
- Coverage   91.78%   91.77%   -0.01%     
==========================================
  Files         174      174              
  Lines       50703    50703              
==========================================
- Hits        46538    46534       -4     
- Misses       4165     4169       +4
Flag Coverage Δ
#multiple 90.37% <100%> (ø) ⬆️
#single 41.81% <100%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.92% <100%> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.88% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3937fbc...59a3c67. Read the comment docs.

1 similar comment
@codecov
Copy link

codecov bot commented Jun 8, 2019

Codecov Report

Merging #26721 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26721      +/-   ##
==========================================
- Coverage   91.78%   91.77%   -0.01%     
==========================================
  Files         174      174              
  Lines       50703    50703              
==========================================
- Hits        46538    46534       -4     
- Misses       4165     4169       +4
Flag Coverage Δ
#multiple 90.37% <100%> (ø) ⬆️
#single 41.81% <100%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.92% <100%> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.88% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3937fbc...59a3c67. Read the comment docs.

@jreback jreback merged commit 23fbf28 into pandas-dev:master Jun 8, 2019
@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

thanks @0x0L very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-index and CategoricalIndex performance
4 participants