-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix series.isin slow issue with Dtype IntegerArray #38379
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not the appropriate place
in algorithms.isin itself instead
Hi, I've move this code to |
pandas/core/algorithms.py
Outdated
@@ -447,6 +447,8 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray: | |||
else: | |||
values = extract_array(values, extract_numpy=True) | |||
|
|||
if type(comps).__name__ == "IntegerArray": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this work if you handle it in the extension array elif
instead of separately up here?
pandas/pandas/core/algorithms.py
Lines 467 to 471 in 3aa8447
elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype( | |
values.dtype | |
): | |
return isin(np.asarray(comps), np.asarray(values)) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the issue here is that we actually need to implement .isin on the EAs (numeric dtypes); we already do this on the internal EAs (datetime, interval etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this work if you handle it in the extension array
elif
instead of separately up here?
pandas/pandas/core/algorithms.py
Lines 467 to 471 in 3aa8447
elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype( values.dtype ): return isin(np.asarray(comps), np.asarray(values))
Hi Andrew,
Are you suggesting something like this:
elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype(
values.dtype
):
if type(comps).__name__ == "IntegerArray":
comps = comps._data
return isin(np.asarray(comps), np.asarray(values))
I've made a test the performance is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that was what i meant
Does anything break, and do you get an isin
speedup, if you relax the IntegerArray
check and use comps._data
for other EAs (such as FloatingArray
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the performance increases from 20ms to 2ms after the modification. Please refer to #38340 if you are interested in the performance tests.
Sure I could add other EAs here, will do that later. Thanks Andrew~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found ExtensionArray does not has the '_data' member, and only subclass of BaseMaskedArray
has the _data member, so I guess the classes below should be considered:
FloatingArray
IntegerArray
BooleanArray
pandas/core/algorithms.py
Outdated
@@ -467,6 +467,8 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray: | |||
elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype( | |||
values.dtype | |||
): | |||
if type(comps).__name__ == "IntegerArray": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to check isinstance(comps, numpy.ma.MaskedArray) I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not the numpy function but check for the masked EA class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we could use
if isinstance(comps, BaseMaskedArray):
here.
pandas/core/algorithms.py
Outdated
@@ -467,6 +467,8 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray: | |||
elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype( | |||
values.dtype | |||
): | |||
if type(comps).__name__ == "IntegerArray": | |||
comps = comps._data # type: ignore[attr-defined, assignment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to lead to incorrect answers, e.g.
import pandas as pd
from pandas.core.algorithms import isin
arr = pd.array([2, 3, pd.NA, 5])
>>> isin(arr, [1])
array([False, False, False, False])
>>> isin(arr._data, [1])
array([False, False, True, False])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jbrockmendel, thanks for pointing out this. If there is pd.NA in the array, the result could be incorrect, but we can multiply the result with ~comps._mask, which is a boolean numpy array for pd.NA values:
def isin_for_masked_array(comps, values):
if isinstance(comps, BaseMaskedArray):
result = isin(comps._data, values) * np.invert(comps._mask)
return result
return isin(comps, values)
I've tested the result is correct and the runtime is not bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if the case is isin([2, 3, pd.NA, 5], [1, pd.NA])
, then above solution is not correct. We have to check if there is null value in the second array, then add the comps._mask to make null value's place become True in the first array. The solution could be:
def isin_for_masked_array2(comps, values):
if isinstance(comps, BaseMaskedArray):
result = isin(comps._data, values) * np.invert(comps._mask)
if any(x is pd.NA for x in values):
result += comps._mask
return result
return isin(comps, values)
We have to be careful when 2nd array contains 1, because MaskArray's NA value will be 1 in self._data.
The solution below could be wrong:
def wrong_solution(comps, values):
if isinstance(comps, BaseMaskedArray):
result = isin(comps._data, values)
if any(x is pd.NA for x in values):
pass
else:
result *= np.invert(comps._mask)
return result
return isin(comps, values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to check if there is null value in the second array
Have to be a little bit careful about which null values you allow, e.g. if values contains pd.NaT
and comps is an IntegerArray, we probably dont want to count it.
As for whether to set NA locations to False or NA, best to check with @jorisvandenbossche for his thoughts on the desired behavior.
On the code organization, #38422 is a sketch of where I suggest you put the implementation.
Here is the jupyter notebook how I analyze and fix this issue. |
Thanks @jbrockmendel and @jreback , I am going to have a look at #38422 |
@jbrockmendel @jorisvandenbossche Hi, I guess None in [1, 2, 3, None] # Will return True but not None
np.nan in np.array([1, 2, np.nan]) # Will return False but not np.nan
pd.NA in pd.Series([1, 2, pd.NA]) # Will return False but not pd.NA Would you please give me some advices? Thanks~ |
I think this is what I'd expect - if |
gentle ping @jorisvandenbossche can you weigh in here on desired behavior |
pandas/core/arrays/masked.py
Outdated
@@ -18,15 +18,17 @@ | |||
) | |||
from pandas.core.dtypes.missing import isna, notna | |||
|
|||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you avoid this import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course~
…t-isin To fix numpy_dev server errors.
thanks @tushushu very nice! sometimes takes a while to get it over the line, but done! |
could use a follow-up with tests |
Indeed, and some other follow-ups from my comment above #38379 as well |
Thanks @jreback @jbrockmendel @jorisvandenbossche @arw2019 , it's my first time to contribute to Pandas. Really learnt a lot from you. |
@@ -141,7 +162,7 @@ def time_isin(self, dtypes, MaxNumber, series_type): | |||
|
|||
class IsInLongSeriesValuesDominate: | |||
params = [ | |||
["int64", "int32", "float64", "float32", "object"], | |||
["int64", "int32", "float64", "float32", "object", "Int64", "Float64"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is causing failures with numpy 1.20
Int64 and Float64 now raise TypeError
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff