-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: speed up CategoricalIndex.get_loc #23235
PERF: speed up CategoricalIndex.get_loc #23235
Conversation
Hello @topper-123! Thanks for updating the PR.
Comment last updated on October 20, 2018 at 13:59 Hours UTC |
77f35df
to
adfcb0c
Compare
Out of curiosity, does this fix improve |
No, MultiIndex has a different implementation, and if I remember correctly, it always uses either int64 dtype or a pure python implementation. So the issue with recoding from int8 to int64 etc. should not be an issue there. (I have not studied the MultiEngine implementation in any great detail, so closer inspection might prove me wrong.) |
505fc8c
to
f7b3487
Compare
Codecov Report
@@ Coverage Diff @@
## master #23235 +/- ##
==========================================
+ Coverage 92.22% 92.22% +<.01%
==========================================
Files 169 169
Lines 50900 50902 +2
==========================================
+ Hits 46943 46945 +2
Misses 3957 3957
Continue to review full report at Codecov.
|
All tests pass. @jreback, I've simplified this new version a lot now by adding 'hashtable_name' and 'hashtable_dtype' to the |
lgtm. over to @jbrockmendel |
@@ -10,7 +10,8 @@ from libc.math cimport fabs, sqrt | |||
import numpy as np | |||
cimport numpy as cnp | |||
from numpy cimport (ndarray, | |||
NPY_INT64, NPY_UINT64, NPY_INT32, NPY_INT16, NPY_INT8, | |||
NPY_INT64, NPY_INT32, NPY_INT16, NPY_INT8, | |||
NPY_UINT64, NPY_UINT32, NPY_UINT16, NPY_UINT8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all of these used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NPY_UINT32, NPY_UINT16 and uint23_t and uint16_t are not used. This needs a discussion, see below.
@@ -1,19 +1,31 @@ | |||
import numpy as np | |||
|
|||
from pandas._libs.index import (Int64Engine, UInt64Engine, Float64Engine, | |||
ObjectEngine) | |||
from pandas._libs import index as li |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big deal, but I think elsewhere we import this as libindex
The explanation in the OP is exceptionally clear, thanks @topper-123. If the possibly-unused imports are resolved, LGTM. |
Actually, do we have tests that cover each of the dtypes that |
Currently, the algos in For exampel currently this fails: >>> arr = np.array([1,2,3,4], dtype=np.uint32)
>>> engine = pd._libs.index.UInt32Engine(lambda: arr, 4)
>>> engine.is_monotonic_increasing
AttributeError: module 'pandas._libs.algos' has no attribute 'ensure_uint32' As I understood @jreback, UInt(32|16)Engine is desired, so I'll add the various uint32/uint16 variants to algos.pyx to make the uint indexing engines work, is that correct? I'll also add a test_indexing_engines.py to capture such errors at the one above. |
f3ad606
to
1b55d52
Compare
Added various stuff to get UInt(32|16)Engine to work, added tests for same and other minor stuff. |
dea7538
to
f91f3c7
Compare
@jbrockmendel, I've made tests for Also, I've made some tests for engine themselves in test_indexing_engine.py. This last bit is needed because some engine types are not used yet in pandas. (These are |
f91f3c7
to
08cd0b9
Compare
can you merge master and push |
08cd0b9
to
a519cc4
Compare
from pandas import compat | ||
from pandas._libs import algos as libalgos, index as libindex | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to just combine these into a single set of tests w/o regards to whether this is object / numeric. this is pretty duplicative set of tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the discussion below, I could just remove the tests for ObjectEngine. ObjectEngine is used in lots of other places.
# monotonic increasing | ||
engine = engine_type(lambda: arr, len(arr)) | ||
assert engine.is_monotonic_increasing is True | ||
assert engine.is_monotonic_decreasing is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are adding a ton of tests here which is good. but i suspect a lot of this indexing on the engines is already tested elsewhere, can you remove it where appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int(32|16|8)Engine, UInt(32|16|8)Engine and Float(32|16|8)Engine are all new, and of these new ones, only Int(32|16|8)Engine are used on code elsewhere (In CategoricalIndex).
This means that unexpected failures could happen in those unused engines...
I could remove tests for Int(8|16|32|64)Engine and (UInt|Float)64Engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not what I mean. I think we have explict test for these engines elsewhere, see if you can find them an dconsolidate them here. You have a nice set of unit tests, but we want to avoid some duplicaton elsehwere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find any tests explicitly for these engines. Lots of tests for e.g. Index.is_monotonic etc. but that's a differerent set of test IMO and we should always have tests for the public APIs.
I've searched for "_engine" and "engine" in the pandas/tests directory, but I'm coming up short. I've found tests for various parser engines, but I can't find any for indexing engines (except one insignificant one in multi/test_contains.py).
If you really think there are tests for indexing engines, could you point me to an example?
thanks @topper-123 |
…_pr2 * repo_org/master: DOC: Add docstring validations for "See Also" section (pandas-dev#23143) TST: Fix test assertion (pandas-dev#23357) BUG: Handle Period in combine (pandas-dev#23350) REF: SparseArray imports (pandas-dev#23329) CI: Migrate some CircleCI jobs to Azure (pandas-dev#22992) DOC: update the is_month_start/is_month_end docstring (pandas-dev#23051) Partialy fix issue pandas-dev#23334 - isort pandas/core/groupby directory (pandas-dev#23341) TST: Add base test for extensionarray setitem pandas-dev#23300 (pandas-dev#23304) API: Add sparse Acessor (pandas-dev#23183) PERF: speed up CategoricalIndex.get_loc (pandas-dev#23235)
…ndas * repo_org/master: (23 commits) DOC: Add docstring validations for "See Also" section (pandas-dev#23143) TST: Fix test assertion (pandas-dev#23357) BUG: Handle Period in combine (pandas-dev#23350) REF: SparseArray imports (pandas-dev#23329) CI: Migrate some CircleCI jobs to Azure (pandas-dev#22992) DOC: update the is_month_start/is_month_end docstring (pandas-dev#23051) Partialy fix issue pandas-dev#23334 - isort pandas/core/groupby directory (pandas-dev#23341) TST: Add base test for extensionarray setitem pandas-dev#23300 (pandas-dev#23304) API: Add sparse Acessor (pandas-dev#23183) PERF: speed up CategoricalIndex.get_loc (pandas-dev#23235) fix and test incorrect case in delta_to_nanoseconds (pandas-dev#23302) BUG: Handle Datetimelike data in DataFrame.combine (pandas-dev#23317) TST: re-enable gbq tests (pandas-dev#23303) Switched references of App veyor to azure pipelines in the contributing CI section (pandas-dev#23311) isort imports-io (pandas-dev#23332) DOC: Added a Multi Index example for the Series.sum method (pandas-dev#23279) REF: Make PeriodArray an ExtensionArray (pandas-dev#22862) DOC: Added Examples for Series max (pandas-dev#23298) API/ENH: tz_localize handling of nonexistent times: rename keyword + add shift option (pandas-dev#22644) BUG: Let MultiIndex.set_levels accept any iterable (pandas-dev#23273) (pandas-dev#23291) ...
…xamples * repo_org/master: (83 commits) DOC: Add docstring validations for "See Also" section (pandas-dev#23143) TST: Fix test assertion (pandas-dev#23357) BUG: Handle Period in combine (pandas-dev#23350) REF: SparseArray imports (pandas-dev#23329) CI: Migrate some CircleCI jobs to Azure (pandas-dev#22992) DOC: update the is_month_start/is_month_end docstring (pandas-dev#23051) Partialy fix issue pandas-dev#23334 - isort pandas/core/groupby directory (pandas-dev#23341) TST: Add base test for extensionarray setitem pandas-dev#23300 (pandas-dev#23304) API: Add sparse Acessor (pandas-dev#23183) PERF: speed up CategoricalIndex.get_loc (pandas-dev#23235) fix and test incorrect case in delta_to_nanoseconds (pandas-dev#23302) BUG: Handle Datetimelike data in DataFrame.combine (pandas-dev#23317) TST: re-enable gbq tests (pandas-dev#23303) Switched references of App veyor to azure pipelines in the contributing CI section (pandas-dev#23311) isort imports-io (pandas-dev#23332) DOC: Added a Multi Index example for the Series.sum method (pandas-dev#23279) REF: Make PeriodArray an ExtensionArray (pandas-dev#22862) DOC: Added Examples for Series max (pandas-dev#23298) API/ENH: tz_localize handling of nonexistent times: rename keyword + add shift option (pandas-dev#22644) BUG: Let MultiIndex.set_levels accept any iterable (pandas-dev#23273) (pandas-dev#23291) ...
git diff upstream/master -u -- "*.py" | flake8 --diff
This is the the final puzzle for #20395 and speeds up
CategoricalIndex.get_loc
for monotonic increasing indexes.This PR supersedes #21699, so #21699 should be closed.
The problem with the current get_loc implementation is that
CategoricalIndex._engine
constantly recodes int8 arrays to int64 arrays:pandas/pandas/core/indexes/category.py
Lines 365 to 369 in dc45fba
Notice the
lambda: self.codes.astype('i8')
part, which means that every time_engine.vgetter
is called, an int64-array is created. This is expensive and we would ideally want to just use the original array dtype (int8, int16 or whatever) always and avoid this conversion.A complicating issue is that it is not enough to just avoid the int64 transformation, as
array.searchsorted
apparantly needs a dtype-compatible input or else it is also very slow:Solution
As CategoricalIndex.codes may be int8, int16, etc, the solution must
(1) have an indexing engine for each integer dtype and
(2) have the code for the key be translated into the same dtype as the codes array before calling searchsorted.
This PR does that, essentially.
Performance improvement examples
So we go from O(n) performance to O(1) performance.
The indexing_engines.py ASV results: