BUG: Index containing NA behaves absolutely unpredictably when length exceeds 128 #58924

avm19 · 2024-06-04T20:47:27Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# OK:
n, val = 127, pd.NA
idx = pd.Index(range(n), dtype="Int64").union(pd.Index([val], dtype="Int64"))
s = pd.Series(index=idx, data=range(n+1), dtype="Int64")
s.drop(0)

# Still OK:
n, val = 128, 128
idx = pd.Index(range(n), dtype="Int64").union(pd.Index([val], dtype="Int64"))
s = pd.Series(index=idx, data=range(n+1), dtype="Int64")
s.drop(0)

# But this FAILS:
n, val = 128, pd.NA
idx = pd.Index(range(n), dtype="Int64").union(pd.Index([val], dtype="Int64"))
s = pd.Series(index=idx, data=range(n+1), dtype="Int64")
s.drop(0)  # ValueError: 'indices' contains values less than allowed (-128 < -1)
# Expected no error

WORKAROUND. to filter out elements, use a boolean mask/indexing instead of s.drop():

s[~s.index.isin([0])]

Issue Description

When NA is present in Index and the length of the Index exceeds 128, it behaves in a completely weird way.

This bug can be narrowed down to IndexEngine.get_indexer() or MaskedIndexEngine.get_indexer(), as these examples suggest:

axis = pd.Index(range(250), dtype='Int64').union(pd.Index([pd.NA], dtype='Int64'))
new_axis = axis.drop(0)
axis.get_indexer(new_axis)[-5:] # array([246, 247, 248, 249,  -6])
# Expected array([246, 247, 248, 249,  250])

axis = pd.Index(range(254), dtype='Int64').union(pd.Index([pd.NA], dtype='Int64'))
new_axis = axis.drop(0)
axis.get_indexer(new_axis)[-5:] # array([250, 251, 252, 253,  -2])
# Expected  array([250, 251, 252, 253,  254])

These examples further suggest that the root cause of the bug is in how NaN is represented in and is interacting with the hash tables that Index uses for its _engine.

Expected Behavior

See above

Installed Versions

commit : 76c7274
python : 3.11.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.216-204.855.amzn2.x86_64
Version : #1 SMP Sat May 4 16:53:27 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+1067.g76c7274985
numpy : 1.26.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pyarrow : 15.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.4
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

avm19 · 2025-02-06T20:40:08Z

take

avm19 · 2025-02-07T05:32:51Z

The bug is likely due to this line:

pandas/pandas/_libs/hashtable_class_helper.pxi.in

Line 538 in 3979e95

int8_t na_position = self.na_position

Should be ~~{{c_type}} na_position = self.na_position~~. Will confirm soon.

Update. Should be Py_ssize_t na_position = self.na_position.

avm19 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 4, 2024

avm19 changed the title ~~BUG: Index containing NA~~ BUG: Index containing NA behaves absolutely unpredictably when length > 128 Jun 4, 2024

avm19 changed the title ~~BUG: Index containing NA behaves absolutely unpredictably when length > 128~~ BUG: Index containing NA behaves absolutely unpredictably when length exceeds 128 Jun 4, 2024

github-actions bot assigned avm19 Feb 6, 2025

rhshadrach added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 12, 2025

avm19 mentioned this issue Mar 5, 2025

BUG: Fix na_position type in IndexEngine #61062

Merged

5 tasks

mroeschke closed this as completed in #61062 Mar 7, 2025

trevorspreadbury mentioned this issue Mar 21, 2025

Major state finance pipeline refactor uchicago-dsi/climate-cabinet-campaign-finance-tracker#127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Index containing NA behaves absolutely unpredictably when length exceeds 128 #58924

BUG: Index containing NA behaves absolutely unpredictably when length exceeds 128 #58924

avm19 commented Jun 4, 2024 •

edited

Loading

avm19 commented Feb 6, 2025

avm19 commented Feb 7, 2025 •

edited

Loading

BUG: Index containing NA behaves absolutely unpredictably when length exceeds 128 #58924

BUG: Index containing NA behaves absolutely unpredictably when length exceeds 128 #58924

Comments

avm19 commented Jun 4, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

avm19 commented Feb 6, 2025

avm19 commented Feb 7, 2025 • edited Loading

avm19 commented Jun 4, 2024 •

edited

Loading

avm19 commented Feb 7, 2025 •

edited

Loading