BUG: isin() with missing values does not work in 1.3.0 with extension dtypes #42405

zmeves · 2021-07-06T15:16:48Z

This bug is not present in Pandas < 1.3.0.

In 1.3.0, calling Series.isin() will fail if

the Series dtype is an extension dtype (pd.Float64Dtype(), pd.Int64Dtype(), ...)
the Series contains any 'missing' values (numpy.nan, pd.na)

The following code snippet tests a few dtypes, determining if each of them supports isin with missing values:

import pandas as pd
import numpy as np

for dtype in (float, int, pd.Float64Dtype(), pd.Int64Dtype(), object):

    x = pd.Series([0, 1, 2, 3, 4], dtype=dtype)
    options = [1, 2, 3]

    print(f"\nTesting with dtype = {x.dtype}:")

    x.isin(options)  # This works everytime - no missing values

    x.iloc[1] = np.nan  # Set a value to NA

    try:
        x.isin(options)  # This no longer works
    except Exception as err:
        print(f"Error! {err}")
    else:  
        print("OK")

# Now, show the actual stack trace
print("\nStacktrace for dtype=Int64")
dtype = pd.Int64Dtype()
x = pd.Series([0, 1, 2, 3, 4], dtype=dtype)
options = [1, 2, 3]
x.iloc[1] = np.nan  # Set a value to NA
x.isin(options)

The output is:

Testing with dtype = float64:
OK

Testing with dtype = int64:
OK

Testing with dtype = Float64:
Error! boolean value of NA is ambiguous

Testing with dtype = Int64:
Error! boolean value of NA is ambiguous

Testing with dtype = object:
OK

Stacktrace for dtype=Int64
Traceback (most recent call last):
  File "...dev/pd_1_3_isin_bug.py", line 31, in <module>
    x.isin(options)
  File "..._dev_venv/lib/python3.7/site-packages/pandas/core/series.py", line 5024, in isin
    result = algorithms.isin(self._values, values)
  File "..._dev_venv/lib/python3.7/site-packages/pandas/core/algorithms.py", line 475, in isin
    return comps.isin(values)
  File "..._dev_venv/lib/python3.7/site-packages/pandas/core/arrays/masked.py", line 408, in isin
    if libmissing.NA in values:
  File "pandas/_libs/missing.pyx", line 446, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous

>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit           : f00ed8f47020034e752baf0250483053340971b0
python           : 3.7.4.final.0
python-bits      : 64
OS               : Linux
OS-release       : 3.10.0-1127.13.1.el7.x86_64
Version          : #1 SMP Fri Jun 12 14:34:17 EDT 2020
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.3.0
numpy            : 1.20.3
pytz             : 2019.3
dateutil         : 2.8.0
pip              : 21.0.1
setuptools       : 40.8.0
Cython           : 0.29.13
pytest           : 5.1.1
hypothesis       : None
sphinx           : 4.0.2
blosc            : None
feather          : None
xlsxwriter       : 1.2.1
lxml.etree       : 4.4.1
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.8.0
pandas_datareader: None
bs4              : 4.8.0
bottleneck       : 1.2.1
fsspec           : 0.5.2
fastparquet      : None
gcsfs            : None
matplotlib       : 3.4.2
numexpr          : 2.7.0
odfpy            : None
openpyxl         : 3.0.7
pandas_gbq       : None
pyarrow          : 0.13.0
pyxlsb           : None
s3fs             : None
scipy            : 1.5.4
sqlalchemy       : 1.3.9
tables           : 3.5.2
tabulate         : None
xarray           : None
xlrd             : 2.0.1
xlwt             : 1.3.0
numba            : 0.45.1

The text was updated successfully, but these errors were encountered:

mzeitlin11 · 2021-07-09T20:19:28Z

Thanks for reporting this regression @zmeves! Seems to be caused by #38379, will take a look.

simonjayhawkins · 2021-07-14T13:25:49Z

Seems to be caused by #38379

can confirm

first bad commit: [5b15515] fix series.isin slow issue with Dtype IntegerArray (#38379)

mzeitlin11 added isin isin method NA - MaskedArrays Related to pd.NA and nullable extension arrays Regression Functionality that used to work in a prior pandas version labels Jul 9, 2021

mzeitlin11 self-assigned this Jul 9, 2021

mzeitlin11 added this to the 1.3.1 milestone Jul 9, 2021

mzeitlin11 mentioned this issue Jul 9, 2021

REGR: isin with nullable types with missing values raising #42473

Merged

5 tasks

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jul 14, 2021

code sample for pandas-dev#42405

2cb9559

jreback closed this as completed in #42473 Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: isin() with missing values does not work in 1.3.0 with extension dtypes #42405

BUG: isin() with missing values does not work in 1.3.0 with extension dtypes #42405

zmeves commented Jul 6, 2021

mzeitlin11 commented Jul 9, 2021

simonjayhawkins commented Jul 14, 2021

BUG: isin() with missing values does not work in 1.3.0 with extension dtypes #42405

BUG: isin() with missing values does not work in 1.3.0 with extension dtypes #42405

Comments

zmeves commented Jul 6, 2021

mzeitlin11 commented Jul 9, 2021

simonjayhawkins commented Jul 14, 2021