Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: isin() with missing values does not work in 1.3.0 with extension dtypes #42405

Closed
zmeves opened this issue Jul 6, 2021 · 2 comments · Fixed by #42473
Closed

BUG: isin() with missing values does not work in 1.3.0 with extension dtypes #42405

zmeves opened this issue Jul 6, 2021 · 2 comments · Fixed by #42473
Assignees
Labels
isin isin method NA - MaskedArrays Related to pd.NA and nullable extension arrays Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@zmeves
Copy link

zmeves commented Jul 6, 2021

This bug is not present in Pandas < 1.3.0.

In 1.3.0, calling Series.isin() will fail if

  • the Series dtype is an extension dtype (pd.Float64Dtype(), pd.Int64Dtype(), ...)
  • the Series contains any 'missing' values (numpy.nan, pd.na)

The following code snippet tests a few dtypes, determining if each of them supports isin with missing values:

import pandas as pd
import numpy as np

for dtype in (float, int, pd.Float64Dtype(), pd.Int64Dtype(), object):

    x = pd.Series([0, 1, 2, 3, 4], dtype=dtype)
    options = [1, 2, 3]

    print(f"\nTesting with dtype = {x.dtype}:")

    x.isin(options)  # This works everytime - no missing values

    x.iloc[1] = np.nan  # Set a value to NA

    try:
        x.isin(options)  # This no longer works
    except Exception as err:
        print(f"Error! {err}")
    else:  
        print("OK")

# Now, show the actual stack trace
print("\nStacktrace for dtype=Int64")
dtype = pd.Int64Dtype()
x = pd.Series([0, 1, 2, 3, 4], dtype=dtype)
options = [1, 2, 3]
x.iloc[1] = np.nan  # Set a value to NA
x.isin(options)

The output is:

Testing with dtype = float64:
OK

Testing with dtype = int64:
OK

Testing with dtype = Float64:
Error! boolean value of NA is ambiguous

Testing with dtype = Int64:
Error! boolean value of NA is ambiguous

Testing with dtype = object:
OK

Stacktrace for dtype=Int64
Traceback (most recent call last):
  File "...dev/pd_1_3_isin_bug.py", line 31, in <module>
    x.isin(options)
  File "..._dev_venv/lib/python3.7/site-packages/pandas/core/series.py", line 5024, in isin
    result = algorithms.isin(self._values, values)
  File "..._dev_venv/lib/python3.7/site-packages/pandas/core/algorithms.py", line 475, in isin
    return comps.isin(values)
  File "..._dev_venv/lib/python3.7/site-packages/pandas/core/arrays/masked.py", line 408, in isin
    if libmissing.NA in values:
  File "pandas/_libs/missing.pyx", line 446, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit           : f00ed8f47020034e752baf0250483053340971b0
python           : 3.7.4.final.0
python-bits      : 64
OS               : Linux
OS-release       : 3.10.0-1127.13.1.el7.x86_64
Version          : #1 SMP Fri Jun 12 14:34:17 EDT 2020
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.3.0
numpy            : 1.20.3
pytz             : 2019.3
dateutil         : 2.8.0
pip              : 21.0.1
setuptools       : 40.8.0
Cython           : 0.29.13
pytest           : 5.1.1
hypothesis       : None
sphinx           : 4.0.2
blosc            : None
feather          : None
xlsxwriter       : 1.2.1
lxml.etree       : 4.4.1
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.8.0
pandas_datareader: None
bs4              : 4.8.0
bottleneck       : 1.2.1
fsspec           : 0.5.2
fastparquet      : None
gcsfs            : None
matplotlib       : 3.4.2
numexpr          : 2.7.0
odfpy            : None
openpyxl         : 3.0.7
pandas_gbq       : None
pyarrow          : 0.13.0
pyxlsb           : None
s3fs             : None
scipy            : 1.5.4
sqlalchemy       : 1.3.9
tables           : 3.5.2
tabulate         : None
xarray           : None
xlrd             : 2.0.1
xlwt             : 1.3.0
numba            : 0.45.1
@mzeitlin11
Copy link
Member

Thanks for reporting this regression @zmeves! Seems to be caused by #38379, will take a look.

@mzeitlin11 mzeitlin11 added isin isin method NA - MaskedArrays Related to pd.NA and nullable extension arrays Regression Functionality that used to work in a prior pandas version labels Jul 9, 2021
@mzeitlin11 mzeitlin11 self-assigned this Jul 9, 2021
@mzeitlin11 mzeitlin11 added this to the 1.3.1 milestone Jul 9, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jul 14, 2021
@simonjayhawkins
Copy link
Member

Seems to be caused by #38379

can confirm

first bad commit: [5b15515] fix series.isin slow issue with Dtype IntegerArray (#38379)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
isin isin method NA - MaskedArrays Related to pd.NA and nullable extension arrays Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants