Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: duplicated() on a empty DataFrame or a DataFrame with an empty subset of columns with a non-empty index #12869

Open
sebov opened this issue Apr 11, 2016 · 15 comments
Labels
Bug duplicated duplicated, drop_duplicates Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@sebov
Copy link

sebov commented Apr 11, 2016

Trying to investigate different subset of data frame's columns we get into trouble when 'duplicated' method is invoked for a data frame sliced to an empty subset of columns.

ValueError                                Traceback (most recent call last)
<ipython-input-672-9c619ea6d0ef> in <module>()
     14 print data_frame[cols].sum()
     15 print "---"
---> 16 print data_frame[cols].duplicated()
     17 
     18 

.../local/lib/python2.7/site-packages/pandas/util/decorators.pyc in wrapper(*args, **kwargs)
     89                 else:
     90                     kwargs[new_arg_name] = new_arg_value
---> 91             return func(*args, **kwargs)
     92         return wrapper
     93     return _deprecate_kwarg

.../local/lib/python2.7/site-packages/pandas/core/frame.pyc in duplicated(self, subset, keep)
   3100 
   3101         vals = (self[col].values for col in subset)
-> 3102         labels, shape = map(list, zip(*map(f, vals)))
   3103 
   3104         ids = get_group_index(labels, shape, sort=False, xnull=False)

ValueError: need more than 0 values to unpack

Code Sample, a copy-pastable example if possible

import pandas as pd
data_frame = pd.DataFrame({'a': [1]*5})
cols = ['a']
print data_frame[cols]
print "---"
print data_frame[cols].sum()
print "---"
print data_frame[cols].duplicated()
print "---"
cols = []
print data_frame[cols]
print "---"
print data_frame[cols].sum()
print "---"
print data_frame[cols].duplicated()

Expected Output

   a
0  1
1  1
2  1
3  1
4  1

---
a    5
dtype: int64

---
0    False
1     True
2     True
3     True
4     True
dtype: bool

---
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

---
Series([], dtype: float64)

---
0    False
1     True
2     True
3     True
4     True
dtype: bool

output of pd.show_versions()

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 1.5.6
setuptools: 12.2
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.16.1
statsmodels: None
xarray: None
IPython: 4.0.3
sphinx: None
patsy: 0.4.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.7.6
lxml: None
bs4: 4.3.2
html5lib: 0.999
httplib2: 0.9
apiclient: None
sqlalchemy: None
pymysql: 0.6.6.None
psycopg2: None
jinja2: 2.8
boto: None
@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

hmm, that is not very friendly.

care to submit a pull-request to fix?

@jreback jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Difficulty Intermediate labels Apr 11, 2016
@jreback jreback added this to the 0.18.1 milestone Apr 11, 2016
@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.19.0 Aug 21, 2016
@dsm054
Copy link
Contributor

dsm054 commented Nov 13, 2018

Just to double-check, what's the expected result here?

As above,

0    False
1     True
2     True
3     True
4     True
dtype: bool

on the grounds that we're comparing the empty row to itself?

@mroeschke
Copy link
Member

Looks to work on master. Could use a test.

In [382]: data_frame = pd.DataFrame({'a': [1]*5})
     ...: cols = ['a']

In [385]: data_frame[cols].duplicated()
Out[385]:
0    False
1     True
2     True
3     True
4     True
dtype: bool

In [389]: pd.__version__
Out[389]: '0.26.0.dev0+593.g9d45934af'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Difficulty Intermediate Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Oct 21, 2019
@grassknoted
Copy link

Hi, @mroeschke just to make sure we're on the same page, for the following snippet:

data_frame = pd.DataFrame({'test_column': [1]*5})
cols = []
print(data_frame[cols].duplicated())

would the following output be correct?

Series([], dtype: bool)

@stevendavis
Copy link

@grassknoted I just came across this by coincidence. You might also consider testing with an empty dataframe to verify there is no ValueError. It looks like the column subsetting is not a necessary step to reproduce the problem in older pandas versions.

$ python
Python 2.7.12 (default, Nov 4 2016, 18:11:59)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df.duplicated()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas-0.20.1-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 3242, in duplicated
    labels, shape = map(list, zip(*map(f, vals)))
ValueError: need more than 0 values to unpack

@sebov
Copy link
Author

sebov commented Oct 21, 2019

Hi @grassknoted, according to the Pandas' docs, duplicated method "Return boolean Series denoting duplicate rows..." and therefore it is quite natural if it always returns boolean Series of size equal to the number of data frame rows. Consider that for the following snippet:

data_frame = pd.DataFrame({'test_column': [1]*5})

cols = ['test_column']
print(data_frame[cols].shape[0])
print(data_frame[cols].duplicated().size)

print('***')

cols = []
print(data_frame[cols].shape[0])
print(data_frame[cols].duplicated().size)

the output is:

5
5
***
5
0

and this is a bit inconsistent.

So, for me the proper result for duplicated invoked on data_frame in both cases should be:

0    False
1     True
2     True
3     True
4     True
dtype: bool

@grassknoted
Copy link

I tried reproducing the issue:

>>>import pandas as pd
>>>df = pd.DataFrame({'a':[1]*5})
>>> cols = ['test_column']
>>> print(dataframe[cols].shape[0])
5
>>> print(dataframe[cols].duplicated().size)
5

>>> cols = []
>>> print(dataframe[cols].shape[0])
5
>>> print(dataframe[cols].duplicated().size)
0
>>> pd.__version__
'0.26.0.dev0+621.g6c898e6a5'

To me, this still looks like a bug, and needs fixing, not just more tests. Could you please confirm?

@sebov
Copy link
Author

sebov commented Oct 23, 2019

Yes, from my point of view that should be considered a bug.

@mroeschke mroeschke added Bug duplicated duplicated, drop_duplicates and removed Needs Tests Unit test(s) needed to prevent regressions good first issue labels Oct 23, 2019
@mroeschke mroeschke added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Oct 23, 2019
@grassknoted
Copy link

grassknoted commented Oct 24, 2019

Hi @sebov, looking at the trace, I can see that the discrepancy is caused by the following lines:

def duplicated(self, subset=None, keep="first"):

      # These lines:
      if self.empty:
          return Series(dtype=bool)

df[ [<empty_columns>] ] returns an empty object
The first check in the duplicated() function, checks for an empty DataFrame, and returns an empty Series.
I think I found a possible solution to circumvent this problem. Using the index property of the DataFrame, the function would be changed to use a default RangeIndex:

def duplicated(self, subset=None, keep="first"):

      if self.empty:
          return Series(data=[i for i in range(0, self.index.size)], dtype=bool)

With the above changes to the duplicated() function, the following behavior is observed, as expected:

>>>import pandas as pd
>>>dataframe = pd.DataFrame({'test_column':[1]*5})
>>> cols = ['test_column']
>>> print(dataframe[cols].shape[0])
5
>>> dataframe[cols].duplicated()
0    False
1     True
2     True
3     True
4     True
dtype: bool
>>> cols = []
>>> print(dataframe[cols].shape[0])
5
>>> dataframe[cols].duplicated()
0    False
1     True
2     True
3     True
4     True
dtype: bool

Please let me know if this is a possible fix, and how else I should test this, thanks!

@mroeschke
Copy link
Member

Feel free to submit a pull request for a full review @grassknoted!

@grassknoted
Copy link

@mroeschke , I'm new to pandas, and was just trying to find my way around. Could you please point me in the right direction to look, to fix this issue?

@sebov
Copy link
Author

sebov commented Oct 24, 2019

Hi @grassknoted, can you also check if your bugfix works for

cols = []
data_frame.duplicated(subset=cols)

I would expect the above to be more or less equivalent to

cols = []
data_frame[cols].duplicated()

What do you think?

@grassknoted
Copy link

Thanks for the input @sebov !

So, my bugfix was failing for subsets=[], but I think I can circumvent that as well, by adding another check, as follows:

def duplicated(self, subset=None, keep="first"):

     # Check if:
     #     - DataFrame is empty
     #     - Subset parameter is not empty, and subset is an empty list (or None)
      if self.empty or (subset is not None and not subset):
          return Series(data=[i for i in range(0, self.index.size)], dtype=bool)

With this change in the code, the output is as follows:

>>> data_frame = pd.DataFrame({'column1':[1]*3})
>>> cols = ['column1']
>>> data_frame[cols].duplicated()
0    False
1     True
2     True
dtype: bool
>>> data_frame[cols].duplicated(subset=[])
0    False
1     True
2     True
dtype: bool
>>> data_frame.duplicated(subset=[])
0    False
1     True
2     True
dtype: bool
>>> cols = []
>>> data_frame[cols].duplicated()
0    False
1     True
2     True
dtype: bool

Thanks for pointing that out, please do let me know if there are any other tests I should run.

@grassknoted
Copy link

@sebov, any updates on this issue?

@simonjayhawkins simonjayhawkins changed the title duplicated() invoked on a data frame with an empty subset of columns BUG: duplicated() on a empty DataFrame or a DataFrame with an empty subset of columns with a non-empty index Jun 11, 2022
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
ptth222 added a commit to ptth222/pandas that referenced this issue Mar 27, 2025
@ptth222
Copy link

ptth222 commented Mar 27, 2025

I'm not sure why the expected behavior is expected for the last one. data_frame[cols] returns an empty DataFrame and then you ask .duplicated() to find duplicates in that empty DataFrame. Returning an empty Series seems like what should happen. How could it possibly return the expected result? Why wouldn't you expect .sum() and .duplicated() to both return an empty Series?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug duplicated duplicated, drop_duplicates Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants