Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix series.isin slow issue with Dtype IntegerArray #38379

Merged
merged 39 commits into from
Jan 20, 2021

Conversation

tushushu
Copy link
Contributor

@tushushu tushushu commented Dec 9, 2020

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the appropriate place

in algorithms.isin itself instead

@tushushu
Copy link
Contributor Author

tushushu commented Dec 9, 2020

this is not the appropriate place

in algorithms.isin itself instead

Hi, I've move this code to algorithms.isin, but due to circular import, we cannot import IntegerArray in that file. So I have to use comp.__class__.__name__ is 'IntegerArray' instead. If you got better solution, please give me some advices, thanks so much.

@@ -447,6 +447,8 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:
else:
values = extract_array(values, extract_numpy=True)

if type(comps).__name__ == "IntegerArray":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work if you handle it in the extension array elif instead of separately up here?

elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype(
values.dtype
):
return isin(np.asarray(comps), np.asarray(values))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the issue here is that we actually need to implement .isin on the EAs (numeric dtypes); we already do this on the internal EAs (datetime, interval etc).

cc @jbrockmendel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work if you handle it in the extension array elif instead of separately up here?

elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype(
values.dtype
):
return isin(np.asarray(comps), np.asarray(values))

Hi Andrew,
Are you suggesting something like this:

 elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype( 
     values.dtype 
 ): 
     if type(comps).__name__ == "IntegerArray":
         comps = comps._data
     return isin(np.asarray(comps), np.asarray(values)) 

I've made a test the performance is good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that was what i meant

Does anything break, and do you get an isin speedup, if you relax the IntegerArray check and use comps._data for other EAs (such as FloatingArray)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the performance increases from 20ms to 2ms after the modification. Please refer to #38340 if you are interested in the performance tests.

Sure I could add other EAs here, will do that later. Thanks Andrew~

Copy link
Contributor Author

@tushushu tushushu Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found ExtensionArray does not has the '_data' member, and only subclass of BaseMaskedArray has the _data member, so I guess the classes below should be considered:
FloatingArray
IntegerArray
BooleanArray

@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Dec 10, 2020
@pep8speaks
Copy link

pep8speaks commented Dec 11, 2020

Hello @tushushu! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-01-19 15:29:07 UTC

@@ -467,6 +467,8 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:
elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype(
values.dtype
):
if type(comps).__name__ == "IntegerArray":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to check isinstance(comps, numpy.ma.MaskedArray) I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not the numpy function but check for the masked EA class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we could use

if isinstance(comps, BaseMaskedArray):

here.

@@ -467,6 +467,8 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:
elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype(
values.dtype
):
if type(comps).__name__ == "IntegerArray":
comps = comps._data # type: ignore[attr-defined, assignment]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to lead to incorrect answers, e.g.

import pandas as pd
from pandas.core.algorithms import isin

arr = pd.array([2, 3, pd.NA, 5])

>>> isin(arr, [1])
array([False, False, False, False])
>>> isin(arr._data, [1])
array([False, False,  True, False])

Copy link
Contributor Author

@tushushu tushushu Dec 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jbrockmendel, thanks for pointing out this. If there is pd.NA in the array, the result could be incorrect, but we can multiply the result with ~comps._mask, which is a boolean numpy array for pd.NA values:

def isin_for_masked_array(comps, values):
    if isinstance(comps, BaseMaskedArray):
        result = isin(comps._data, values) * np.invert(comps._mask)
        return result
    return isin(comps, values)

I've tested the result is correct and the runtime is not bad.

Copy link
Contributor Author

@tushushu tushushu Dec 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if the case is isin([2, 3, pd.NA, 5], [1, pd.NA]), then above solution is not correct. We have to check if there is null value in the second array, then add the comps._mask to make null value's place become True in the first array. The solution could be:

def isin_for_masked_array2(comps, values):
    if isinstance(comps, BaseMaskedArray):
        result = isin(comps._data, values) * np.invert(comps._mask)
        if any(x is pd.NA for x in values):
            result += comps._mask
        return result
    return isin(comps, values)

We have to be careful when 2nd array contains 1, because MaskArray's NA value will be 1 in self._data.
The solution below could be wrong:

def wrong_solution(comps, values):
    if isinstance(comps, BaseMaskedArray):
        result = isin(comps._data, values)
        if any(x is pd.NA for x in values):
            pass
        else:
             result *= np.invert(comps._mask)
        return result
    return isin(comps, values)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to check if there is null value in the second array

Have to be a little bit careful about which null values you allow, e.g. if values contains pd.NaT and comps is an IntegerArray, we probably dont want to count it.

As for whether to set NA locations to False or NA, best to check with @jorisvandenbossche for his thoughts on the desired behavior.

On the code organization, #38422 is a sketch of where I suggest you put the implementation.

@tushushu
Copy link
Contributor Author

Here is the jupyter notebook how I analyze and fix this issue.
https://github.com/tushushu/tuopen/blob/main/pandas-pr-38379.ipynb

@jreback
Copy link
Contributor

jreback commented Dec 13, 2020

@tushushu suggest you follow what is going on in #38422 for this.

@tushushu
Copy link
Contributor Author

Thanks @jbrockmendel and @jreback , I am going to have a look at #38422

@jbrockmendel jbrockmendel mentioned this pull request Dec 20, 2020
5 tasks
@tushushu
Copy link
Contributor Author

tushushu commented Dec 27, 2020

@jbrockmendel @jorisvandenbossche Hi, I guess isin(1, 2, pd.NA], [1, pd.NA]) could return [True, False, True] but not [True, False, pd.NA].
Below are something might be useful, and I am also curious why pd.NA in pd.Series([1, 2, pd.NA] should return False?

None in [1, 2, 3, None]  # Will return True but not None
np.nan in np.array([1, 2, np.nan])  # Will return False but not np.nan
pd.NA in pd.Series([1, 2, pd.NA])  # Will return False but not pd.NA

Would you please give me some advices? Thanks~

@MarcoGorelli
Copy link
Member

I am also curious why pd.NA in pd.Series([1, 2, pd.NA] should return False?

I think this is what I'd expect - if pd.NA could be anything, then you don't know if it's in pd.Series([1, 2, pd.NA])

@jbrockmendel
Copy link
Member

gentle ping @jorisvandenbossche can you weigh in here on desired behavior

@@ -18,15 +18,17 @@
)
from pandas.core.dtypes.missing import isna, notna

import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you avoid this import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course~

@jreback jreback merged commit 5b15515 into pandas-dev:master Jan 20, 2021
@jreback
Copy link
Contributor

jreback commented Jan 20, 2021

thanks @tushushu very nice!

sometimes takes a while to get it over the line, but done!

@jbrockmendel
Copy link
Member

could use a follow-up with tests

@jorisvandenbossche
Copy link
Member

Indeed, and some other follow-ups from my comment above #38379 as well

nofarm3 pushed a commit to nofarm3/pandas that referenced this pull request Jan 21, 2021
@tushushu
Copy link
Contributor Author

Thanks @jreback @jbrockmendel @jorisvandenbossche @arw2019 , it's my first time to contribute to Pandas. Really learnt a lot from you.

@@ -141,7 +162,7 @@ def time_isin(self, dtypes, MaxNumber, series_type):

class IsInLongSeriesValuesDominate:
params = [
["int64", "int32", "float64", "float32", "object"],
["int64", "int32", "float64", "float32", "object", "Int64", "Float64"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is causing failures with numpy 1.20

Int64 and Float64 now raise TypeError

https://github.com/pandas-dev/pandas/runs/1802727214

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: implement fast isin() for nullable dtypes
9 participants