Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: value_counts and nunique behave differently for NaN and None with dropna=False #42688

Closed
2 of 3 tasks
knoam opened this issue Jul 23, 2021 · 5 comments · Fixed by #42743
Closed
2 of 3 tasks

BUG: value_counts and nunique behave differently for NaN and None with dropna=False #42688

knoam opened this issue Jul 23, 2021 · 5 comments · Fixed by #42743
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@knoam
Copy link

knoam commented Jul 23, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample

s = pd.Series([None, np.nan, 'Y'])
s.nunique(dropna=False)
s.value_counts(dropna=False)

Problem description

nunique returns 3, but value_counts only has 2 rows. This may be related to Issue #37566

Expected Output

I would expect these to be consistent.

Output of pd.show_versions()

pandas : 1.3.0
numpy : 1.19.5
pytz : 2019.3
dateutil : 2.8.1
pip : 21.1.2
setuptools : 56.0.0
Cython : 0.29.23
pytest : 6.2.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.24.1
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.4.1
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.3.23
tables : None
tabulate : None
xarray : 0.17.0
xlrd : 2.0.1
xlwt : None
numba : 0.52.0

@knoam knoam added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 23, 2021
@Varun270
Copy link
Contributor

@knoam I am new to open source, Can you please explain to me what's this issue is about?

@phofl
Copy link
Member

phofl commented Jul 25, 2021

He expects that both statements return the same, e.g. 3 I think since the nunqiue call seems correct

@phofl phofl added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 25, 2021
@realead
Copy link
Contributor

realead commented Jul 26, 2021

It is value_counts that behaves inconsistently. unqiue behaves consistently to isin, unique, mode and so on, e.g.:

s=pd.Series([None, np.nan], dtype=np.object).unique()  # => [None, np.nan]
pd.Series([None], dtype=np.object).isin([np.nan])           # => [False]

The decision not to mangle np.nan, None was made in #22296. Maybe we need to discuss this again. But mangling them in a consistent way is much harder than leave it this way (and one can always process data replacing all value by e.g. pd.NA as work around).

@realead
Copy link
Contributor

realead commented Jul 26, 2021

This is the "special code" in value_count, that mangles all nans-values, which was probably overlooked in #22296:

is_null = checknull(val)
if not is_null or not dropna:
# all nas become the same representative:
if is_null:
val = navalue

To get consistent behavior it just needs to be deleted.

@aneesh98
Copy link
Contributor

If the change is required in value_counts function to make it consistent with nuniques. Then can I take up on this issue and make a PR based on the code changes suggested by @realead ?. I am new to open source, and I would like to work on this issue.

@jreback jreback added this to the 1.4 milestone Jul 28, 2021
@mroeschke mroeschke added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Aug 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants