BUG: index duplicates keys with non ascii chars #57942

aquirin · 2024-03-20T23:46:18Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

When creating a dataframe with an index containing non-ascii chars, pandas is merging different keys into a single key.

import pandas as pd
car = "é".encode("latin1").decode('utf8', 'surrogateescape')
data = [(1,"a-"+car, "x-"+car), (2, "b-"+car, "y-"+car)]
df = pd.DataFrame(data, columns=["c1", "c2", "c3"]).set_index(["c2", "c3"]).reset_index()
print(list(df["c3"]))

returns the same two keys:

['x-\udce9', 'x-\udce9']

Expected behavior:

['x-\udce9', 'y-\udce9']

Note that when using ascii chars, the behavior is correct:

import pandas as pd
car = "0"
data = [(1,"a-"+car, "x-"+car), (2, "b-"+car, "y-"+car)]
df = pd.DataFrame(data, columns=["c1", "c2", "c3"]).set_index(["c2", "c3"]).reset_index()
print(list(df["c3"]))

returns two different keys:

['x-0', 'y-0']

Note that the behavior is correct with non-ascii char and using a single column in the index:

car = "é".encode("latin1").decode('utf8', 'surrogateescape')
data = [(1,"a-"+car, "x-"+car), (2, "b-"+car, "y-"+car)]
df = pd.DataFrame(data, columns=["c1", "c2", "c3"]).set_index(["c3"]).reset_index()
print(list(df["c3"]))

returns two different keys:

['x-\udce9', 'y-\udce9']

Issue Description

Creating a multi-index with non-ascii characters will not keep unique indices. Instead, keys are merged.

Expected Behavior

Creating a multi-index with non-ascii characters should keep unique keys.

Installed Versions

INSTALLED VERSIONS
------------------
commit                : bdc79c146c2e32f2cab629be240f01658cfb6cc2
python                : 3.12.1.final.0
python-bits           : 64
OS                    : Linux
OS-release            : 3.10.105
Version               : #25556 SMP Sat Aug 28 02:13:34 CST 2021
machine               : x86_64
processor             : x86_64
byteorder             : little
LC_ALL                : en_US.UTF-8
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 2.2.1
numpy                 : 1.26.4
pytz                  : 2024.1
dateutil              : 2.9.0.post0
setuptools            : None
pip                   : 24.0
Cython                : None
pytest                : None
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : 3.1.3
IPython               : 8.22.2
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
gcsfs                 : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
pyarrow               : None
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None

The text was updated successfully, but these errors were encountered:

dontgoto · 2024-03-22T22:31:02Z

Interesting. This seems to happen due to the malformed unicode characters, somewhere in the MultIndex calls a search for duplicates (in e.g., ['x-\udce9', 'y-\udce9']) in the index happens via a cython hash map, this duplicate check does not happen for a single level index.

In the end this filters out a false positive duplicate, leading to the ['x-\udce9', 'x-\udce9'] index values in your example.

I don't think changing the hashmap implementation is a good idea, but one could at least throw an exception if the user wants to put malformed unicode into a MultiIndex.

kvnwng11 · 2024-04-02T21:55:29Z

take

aquirin added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 20, 2024

github-actions bot assigned kvnwng11 Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: index duplicates keys with non ascii chars #57942

BUG: index duplicates keys with non ascii chars #57942

aquirin commented Mar 20, 2024 •

edited

Loading

dontgoto commented Mar 22, 2024

kvnwng11 commented Apr 2, 2024

BUG: index duplicates keys with non ascii chars #57942

BUG: index duplicates keys with non ascii chars #57942

Comments

aquirin commented Mar 20, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

dontgoto commented Mar 22, 2024

kvnwng11 commented Apr 2, 2024

aquirin commented Mar 20, 2024 •

edited

Loading