fix series.isin slow issue with Dtype IntegerArray #38379

tushushu · 2020-12-09T01:08:46Z

closes ENH: implement fast isin() for nullable dtypes #38340
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback

this is not the appropriate place

in algorithms.isin itself instead

tushushu · 2020-12-09T08:02:34Z

this is not the appropriate place

in algorithms.isin itself instead

Hi, I've move this code to algorithms.isin, but due to circular import, we cannot import IntegerArray in that file. So I have to use comp.__class__.__name__ is 'IntegerArray' instead. If you got better solution, please give me some advices, thanks so much.

arw2019 · 2020-12-09T16:13:02Z

pandas/core/algorithms.py

@@ -447,6 +447,8 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:
    else:
        values = extract_array(values, extract_numpy=True)

+    if type(comps).__name__ == "IntegerArray":


Will this work if you handle it in the extension array elif instead of separately up here?

pandas/pandas/core/algorithms.py

Lines 467 to 471 in 3aa8447

elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype(

values.dtype

):

return isin(np.asarray(comps), np.asarray(values))

so the issue here is that we actually need to implement .isin on the EAs (numeric dtypes); we already do this on the internal EAs (datetime, interval etc).

cc @jbrockmendel

Will this work if you handle it in the extension array elif instead of separately up here?

pandas/pandas/core/algorithms.py

Lines 467 to 471 in 3aa8447

elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype(

values.dtype

):

return isin(np.asarray(comps), np.asarray(values))

Hi Andrew,
Are you suggesting something like this:

elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype( values.dtype ): if type(comps).__name__ == "IntegerArray": comps = comps._data return isin(np.asarray(comps), np.asarray(values))

I've made a test the performance is good.

yeah that was what i meant

Does anything break, and do you get an isin speedup, if you relax the IntegerArray check and use comps._data for other EAs (such as FloatingArray)?

Yes, the performance increases from 20ms to 2ms after the modification. Please refer to #38340 if you are interested in the performance tests.

Sure I could add other EAs here, will do that later. Thanks Andrew~

I found ExtensionArray does not has the '_data' member, and only subclass of BaseMaskedArray has the _data member, so I guess the classes below should be considered:
FloatingArray
IntegerArray
BooleanArray

pep8speaks · 2020-12-11T06:15:24Z

Hello @tushushu! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-01-19 15:29:07 UTC

arw2019 · 2020-12-11T07:50:28Z

pandas/core/algorithms.py

@@ -467,6 +467,8 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:
    elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype(
        values.dtype
    ):
+        if type(comps).__name__ == "IntegerArray":


You want to check isinstance(comps, numpy.ma.MaskedArray) I think

Maybe not the numpy function but check for the masked EA class

So we could use

if isinstance(comps, BaseMaskedArray):

here.

jbrockmendel · 2020-12-11T21:33:51Z

pandas/core/algorithms.py

@@ -467,6 +467,8 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:
    elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype(
        values.dtype
    ):
+        if type(comps).__name__ == "IntegerArray":
+            comps = comps._data  # type: ignore[attr-defined, assignment]


This is going to lead to incorrect answers, e.g.

import pandas as pd from pandas.core.algorithms import isin arr = pd.array([2, 3, pd.NA, 5]) >>> isin(arr, [1]) array([False, False, False, False]) >>> isin(arr._data, [1]) array([False, False, True, False])

Hi @jbrockmendel, thanks for pointing out this. If there is pd.NA in the array, the result could be incorrect, but we can multiply the result with ~comps._mask, which is a boolean numpy array for pd.NA values:

def isin_for_masked_array(comps, values): if isinstance(comps, BaseMaskedArray): result = isin(comps._data, values) * np.invert(comps._mask) return result return isin(comps, values)

I've tested the result is correct and the runtime is not bad.

And if the case is isin([2, 3, pd.NA, 5], [1, pd.NA]), then above solution is not correct. We have to check if there is null value in the second array, then add the comps._mask to make null value's place become True in the first array. The solution could be:

def isin_for_masked_array2(comps, values): if isinstance(comps, BaseMaskedArray): result = isin(comps._data, values) * np.invert(comps._mask) if any(x is pd.NA for x in values): result += comps._mask return result return isin(comps, values)

We have to be careful when 2nd array contains 1, because MaskArray's NA value will be 1 in self._data.
The solution below could be wrong:

def wrong_solution(comps, values): if isinstance(comps, BaseMaskedArray): result = isin(comps._data, values) if any(x is pd.NA for x in values): pass else: result *= np.invert(comps._mask) return result return isin(comps, values)

We have to check if there is null value in the second array

Have to be a little bit careful about which null values you allow, e.g. if values contains pd.NaT and comps is an IntegerArray, we probably dont want to count it.

As for whether to set NA locations to False or NA, best to check with @jorisvandenbossche for his thoughts on the desired behavior.

On the code organization, #38422 is a sketch of where I suggest you put the implementation.

tushushu · 2020-12-12T10:53:22Z

Here is the jupyter notebook how I analyze and fix this issue.
https://github.com/tushushu/tuopen/blob/main/pandas-pr-38379.ipynb

jreback · 2020-12-13T17:50:28Z

@tushushu suggest you follow what is going on in #38422 for this.

tushushu · 2020-12-14T02:39:04Z

Thanks @jbrockmendel and @jreback , I am going to have a look at #38422

pandas/core/arrays/masked.py

tushushu · 2020-12-27T12:16:47Z

@jbrockmendel @jorisvandenbossche Hi, I guess isin(1, 2, pd.NA], [1, pd.NA]) could return [True, False, True] but not [True, False, pd.NA].
Below are something might be useful, and I am also curious why pd.NA in pd.Series([1, 2, pd.NA] should return False?

None in [1, 2, 3, None]  # Will return True but not None
np.nan in np.array([1, 2, np.nan])  # Will return False but not np.nan
pd.NA in pd.Series([1, 2, pd.NA])  # Will return False but not pd.NA

Would you please give me some advices? Thanks~

MarcoGorelli · 2020-12-27T12:56:31Z

I am also curious why pd.NA in pd.Series([1, 2, pd.NA] should return False?

I think this is what I'd expect - if pd.NA could be anything, then you don't know if it's in pd.Series([1, 2, pd.NA])

jbrockmendel · 2020-12-29T05:24:07Z

gentle ping @jorisvandenbossche can you weigh in here on desired behavior

jbrockmendel · 2020-12-29T05:24:48Z

pandas/core/arrays/masked.py

@@ -18,15 +18,17 @@
 )
 from pandas.core.dtypes.missing import isna, notna

+import pandas as pd


can you avoid this import

…t-isin To fix numpy_dev server errors.

…t-isin Rebase.

jreback · 2021-01-20T01:44:53Z

thanks @tushushu very nice!

sometimes takes a while to get it over the line, but done!

jbrockmendel · 2021-01-20T03:00:35Z

could use a follow-up with tests

jorisvandenbossche · 2021-01-20T07:35:02Z

Indeed, and some other follow-ups from my comment above #38379 as well

tushushu · 2021-01-21T13:41:01Z

Thanks @jreback @jbrockmendel @jorisvandenbossche @arw2019 , it's my first time to contribute to Pandas. Really learnt a lot from you.

simonjayhawkins · 2021-02-01T11:29:59Z

asv_bench/benchmarks/series_methods.py

@@ -141,7 +162,7 @@ def time_isin(self, dtypes, MaxNumber, series_type):

 class IsInLongSeriesValuesDominate:
    params = [
-        ["int64", "int32", "float64", "float32", "object"],
+        ["int64", "int32", "float64", "float32", "object", "Int64", "Float64"],


This is causing failures with numpy 1.20

Int64 and Float64 now raise TypeError

https://github.com/pandas-dev/pandas/runs/1802727214

fix series.isin slow issue with Dtype IntegerArray

109c0e7

tushushu mentioned this pull request Dec 9, 2020

ENH: implement fast isin() for nullable dtypes #38340

Closed

jreback requested changes Dec 9, 2020

View reviewed changes

tushushu added 2 commits December 9, 2020 15:09

Move isinstance(comps, IntegerArray) to algo.isin

e9f96ea

cannot import IntegerArray due to circular import

a6be9c8

tushushu added 3 commits December 9, 2020 18:08

fix bug in pandas (Linux py38_np_dev)

415b590

fix pre commit issue.

f3e5afb

fix the code style issue.

14579fc

arw2019 reviewed Dec 9, 2020

View reviewed changes

jreback added ExtensionArray Indexing Performance labels Dec 10, 2020

move the logic to elif block.

562c918

remove blank line.

1449d3c

arw2019 reviewed Dec 11, 2020

View reviewed changes

jbrockmendel reviewed Dec 11, 2020

View reviewed changes

jbrockmendel mentioned this pull request Dec 20, 2020

ENH/POC: EA.isin #38422

Closed

5 tasks

tushushu added 2 commits December 27, 2020 16:17

copy codes from pandas-dev#38422

3ccc917

make isin correct for pd.NA

98a0683

tushushu commented Dec 27, 2020

View reviewed changes

pandas/core/arrays/masked.py Show resolved Hide resolved

sort imports

6e2917e

jbrockmendel reviewed Dec 29, 2020

View reviewed changes

tushushu added 11 commits January 14, 2021 23:12

Merge remote-tracking branch 'upstream/master' into ENH-implement-fas…

2279519

…t-isin To fix numpy_dev server errors.

makes NA isin [NA] return True.

c726d4a

remove redundant codes.

1134ad6

makes performance better.

bce0e3e

fix flake8 errors.

bf788e5

polish codes

40950be

not import NA

570d640

fix code style

0f89578

fix black error.

199c11c

fix CI

9f35b5b

Merge remote-tracking branch 'upstream/master' into ENH-implement-fas…

2238dc5

…t-isin Rebase.

jreback approved these changes Jan 20, 2021

View reviewed changes

jreback merged commit 5b15515 into pandas-dev:master Jan 20, 2021

nofarm3 pushed a commit to nofarm3/pandas that referenced this pull request Jan 21, 2021

fix series.isin slow issue with Dtype IntegerArray (pandas-dev#38379)

85e4004

simonjayhawkins reviewed Feb 1, 2021

View reviewed changes

This was referenced Feb 1, 2021

CI: pin numpy for CI / Checks github action #39526

Merged

CI: unpin numpy for CI / Checks github action #36092

Merged

This was referenced Feb 13, 2021

fix benchmark failure with numpy 1.20+ #39795

Merged

CI: series_methods.IsInLongSeriesLookUpDominates.time_isin fails with NumPy 1.20+ #39844

Closed

Backport PR #39795 on branch 1.2.x (fix benchmark failure with numpy 1.20+) #39842

Closed

This was referenced Jul 9, 2021

BUG: isin() with missing values does not work in 1.3.0 with extension dtypes #42405

Closed

API: ExtensionArray.isin treatment of missing values #42545

Closed

phofl mentioned this pull request Mar 11, 2022

BUG: isin propagates nulls for DataFrame but not Series #46326

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix series.isin slow issue with Dtype IntegerArray #38379

fix series.isin slow issue with Dtype IntegerArray #38379

tushushu commented Dec 9, 2020 •

edited

Loading

jreback left a comment

tushushu commented Dec 9, 2020

arw2019 Dec 9, 2020

jreback Dec 10, 2020

tushushu Dec 11, 2020

arw2019 Dec 11, 2020

tushushu Dec 11, 2020

tushushu Dec 11, 2020 •

edited

Loading

pep8speaks commented Dec 11, 2020 •

edited

Loading

arw2019 Dec 11, 2020

arw2019 Dec 11, 2020

tushushu Dec 12, 2020

jbrockmendel Dec 11, 2020

tushushu Dec 12, 2020 •

edited

Loading

tushushu Dec 12, 2020 •

edited

Loading

jbrockmendel Dec 13, 2020

tushushu commented Dec 12, 2020

jreback commented Dec 13, 2020

tushushu commented Dec 14, 2020

tushushu commented Dec 27, 2020 •

edited

Loading

MarcoGorelli commented Dec 27, 2020

jbrockmendel commented Dec 29, 2020

jbrockmendel Dec 29, 2020

tushushu Dec 31, 2020

jreback commented Jan 20, 2021

jbrockmendel commented Jan 20, 2021

jorisvandenbossche commented Jan 20, 2021

tushushu commented Jan 21, 2021

simonjayhawkins Feb 1, 2021

	elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype(
	values.dtype
	):
	return isin(np.asarray(comps), np.asarray(values))

fix series.isin slow issue with Dtype IntegerArray #38379

fix series.isin slow issue with Dtype IntegerArray #38379

Conversation

tushushu commented Dec 9, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

tushushu commented Dec 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tushushu Dec 11, 2020 • edited Loading

Choose a reason for hiding this comment

pep8speaks commented Dec 11, 2020 • edited Loading

Comment last updated at 2021-01-19 15:29:07 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tushushu Dec 12, 2020 • edited Loading

Choose a reason for hiding this comment

tushushu Dec 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tushushu commented Dec 12, 2020

jreback commented Dec 13, 2020

tushushu commented Dec 14, 2020

tushushu commented Dec 27, 2020 • edited Loading

MarcoGorelli commented Dec 27, 2020

jbrockmendel commented Dec 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 20, 2021

jbrockmendel commented Jan 20, 2021

jorisvandenbossche commented Jan 20, 2021

tushushu commented Jan 21, 2021

Choose a reason for hiding this comment

tushushu commented Dec 9, 2020 •

edited

Loading

tushushu Dec 11, 2020 •

edited

Loading

pep8speaks commented Dec 11, 2020 •

edited

Loading

tushushu Dec 12, 2020 •

edited

Loading

tushushu Dec 12, 2020 •

edited

Loading

tushushu commented Dec 27, 2020 •

edited

Loading