Pandas 0.22.0 does not raise KeyError for misspelled column with .drop_duplicates() #19726

aktivkohle · 2018-02-16T11:58:02Z

So I have tested two versions of Pandas parallel to each other with exactly the same code. 0.19.2 behaves more as expected, but 0.22.0 does what I am about to describe. Will probably switch to 0.19.2 for now. Am using Python 3.6.4

import pandas as pd
df = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,81,87], "C":[56,78,0,14,13], "D":[0,87,72,87,14], "E":[78,12,31,0,34]}) 

print(df.drop_duplicates(['b','D']))
print(df.drop_duplicates(['B','D']))
print(df.drop_duplicates(['B']))
print(df.drop_duplicates(['D']))

Problem description

I became aware of the problem working with a much larger dataframe when it failed to warn me or raise a KeyError when I misspelled a column name.

Expected Output

Pandas 0.19.2 gives you the following and but Pandas 22 gives you no KeyError for the first print statement it just runs.

KeyError: 'b'

    A   B   C   D   E
0  34  54  56   0  78
1  12  87  78  87  12
2  78  35   0  72  31
3  84  81  14  87   0
4  26  87  13  14  34

    A   B   C   D   E
0  34  54  56   0  78
1  12  87  78  87  12
2  78  35   0  72  31
3  84  81  14  87   0

    A   B   C   D   E
0  34  54  56   0  78
1  12  87  78  87  12
2  78  35   0  72  31
4  26  87  13  14  34

Output of `pd.show_versions()`

Below is the output for the Pandas version for where the problem is.

INSTALLED VERSIONS

commit: None

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-02-16T12:09:50Z

Thanks for the report. The root cause looks like it's in DataFrame.duplicated.

Interested in submitting a fix?

TomAugspurger · 2018-02-16T12:11:41Z

Whoever fixes this may be able to knock out #12869 at the same time.

aktivkohle · 2018-02-16T13:23:54Z

@TomAugspurger I might have a crack, see if someone beats me to it. Had a quick look with PyCharm, have not managed to quickly locate that variable / class, it must be in there somewhere..

TomAugspurger · 2018-02-16T13:29:22Z

Great! I think the offending lines are

pandas/pandas/core/frame.py

Lines 3658 to 3660 in 2fdf1e2

    
           vals = (col.values for name, col in self.iteritems() 
        
                   if name in subset) 
        
           labels, shape = map(list, zip(*map(f, vals)))

Making some bad assumptions about subset. LMK if you have any troubles.

NoahTheDuke · 2018-02-16T17:43:38Z

I've got a working fix. Mind if I submit the pull request? Or did you want to take a shot at it, @aktivkohle?

aktivkohle · 2018-02-16T17:47:19Z

@NoahTheDuke Yes do it, thanks for asking.. I'm sure I'll get my chance one day, and would probably take hours to work it out on the weekend which is not meant for that kind of thing. Will have a look at how you did it of course..

TomAugspurger added Regression Functionality that used to work in a prior pandas version Effort Low good first issue labels Feb 16, 2018

TomAugspurger added this to the Next Major Release milestone Feb 16, 2018

NoahTheDuke mentioned this issue Feb 16, 2018

BUG: drop_duplicates not raising KeyError on missing key #19730

Merged

4 tasks

jreback modified the milestones: Next Major Release, 0.23.0 Feb 18, 2018

jreback closed this as completed in #19730 Feb 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas 0.22.0 does not raise KeyError for misspelled column with .drop_duplicates() #19726

Pandas 0.22.0 does not raise KeyError for misspelled column with .drop_duplicates() #19726

aktivkohle commented Feb 16, 2018

INSTALLED VERSIONS

TomAugspurger commented Feb 16, 2018

TomAugspurger commented Feb 16, 2018

aktivkohle commented Feb 16, 2018

TomAugspurger commented Feb 16, 2018

NoahTheDuke commented Feb 16, 2018

aktivkohle commented Feb 16, 2018

Pandas 0.22.0 does not raise KeyError for misspelled column with .drop_duplicates() #19726

Pandas 0.22.0 does not raise KeyError for misspelled column with .drop_duplicates() #19726

Comments

aktivkohle commented Feb 16, 2018

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Feb 16, 2018

TomAugspurger commented Feb 16, 2018

aktivkohle commented Feb 16, 2018

TomAugspurger commented Feb 16, 2018

NoahTheDuke commented Feb 16, 2018

aktivkohle commented Feb 16, 2018

Output of `pd.show_versions()`