-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: duplicated() on a empty DataFrame or a DataFrame with an empty subset of columns with a non-empty index #12869
Comments
hmm, that is not very friendly. care to submit a pull-request to fix? |
Just to double-check, what's the expected result here? As above,
on the grounds that we're comparing the empty row to itself? |
Looks to work on master. Could use a test.
|
Hi, @mroeschke just to make sure we're on the same page, for the following snippet:
would the following output be correct?
|
@grassknoted I just came across this by coincidence. You might also consider testing with an empty dataframe to verify there is no ValueError. It looks like the column subsetting is not a necessary step to reproduce the problem in older pandas versions. $ python
|
Hi @grassknoted, according to the Pandas' docs, data_frame = pd.DataFrame({'test_column': [1]*5})
cols = ['test_column']
print(data_frame[cols].shape[0])
print(data_frame[cols].duplicated().size)
print('***')
cols = []
print(data_frame[cols].shape[0])
print(data_frame[cols].duplicated().size) the output is: 5
5
***
5
0 and this is a bit inconsistent. So, for me the proper result for 0 False
1 True
2 True
3 True
4 True
dtype: bool |
I tried reproducing the issue:
To me, this still looks like a bug, and needs fixing, not just more tests. Could you please confirm? |
Yes, from my point of view that should be considered a bug. |
Hi @sebov, looking at the trace, I can see that the discrepancy is caused by the following lines:
With the above changes to the
Please let me know if this is a possible fix, and how else I should test this, thanks! |
Feel free to submit a pull request for a full review @grassknoted! |
@mroeschke , I'm new to pandas, and was just trying to find my way around. Could you please point me in the right direction to look, to fix this issue? |
Hi @grassknoted, can you also check if your bugfix works for cols = []
data_frame.duplicated(subset=cols) I would expect the above to be more or less equivalent to cols = []
data_frame[cols].duplicated() What do you think? |
Thanks for the input @sebov ! So, my bugfix was failing for
With this change in the code, the output is as follows:
Thanks for pointing that out, please do let me know if there are any other tests I should run. |
@sebov, any updates on this issue? |
I'm not sure why the expected behavior is expected for the last one. data_frame[cols] returns an empty DataFrame and then you ask .duplicated() to find duplicates in that empty DataFrame. Returning an empty Series seems like what should happen. How could it possibly return the expected result? Why wouldn't you expect .sum() and .duplicated() to both return an empty Series? |
Trying to investigate different subset of data frame's columns we get into trouble when 'duplicated' method is invoked for a data frame sliced to an empty subset of columns.
Code Sample, a copy-pastable example if possible
Expected Output
output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: