BUG: Don't cast categorical nan to int #28438

dsaxton · 2019-09-13T21:42:13Z

closes Converting from categorical to int ignores NaNs #28406
passes black pandas
tests added / passed
whatsnew entry

This raises an error when attempting to cast a Categorical or CategoricalIndex containing nans to an integer dtype. Also had to remove the casting within get_indexer_non_unique since this won't always be possible.

pandas/tests/extension/test_categorical.py

pandas/core/arrays/categorical.py

dsaxton · 2019-09-14T01:08:14Z

FWIW the issue with test_get_indexer_non_unique is here https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/base.py#L4717. If target is Categorical then the values can't always be cast to the dtype of the categories.

jbrockmendel · 2019-09-14T01:13:44Z

should the use_inf_as_na option matter here?

Categorical raises a ValueError at the moment, but CategoricalIndex ends up raising a TypeError because this happens during the handling of the ValueError

dsaxton · 2019-09-14T02:06:40Z

should the use_inf_as_na option matter here?

Wasn't familiar with that option, but it shouldn't affect numpy behavior I think?

pandas/tests/extension/test_categorical.py

This reverts commit 504be90.

pandas/core/indexes/base.py

TomAugspurger

I think this PR should focus on just NaN handling.

I'd rather see Float64Index properly handle astyping inf values to int (raise), and then do self.categories.take(self.codes) with a check if any of the codes is negative, so that we raise.

pandas/core/indexes/base.py

doc/source/whatsnew/v1.0.0.rst

pandas/core/arrays/categorical.py

Co-Authored-By: Tom Augspurger <TomAugspurger@users.noreply.github.com>

dsaxton · 2019-09-16T18:18:15Z

I think this PR should focus on just NaN handling.

I'd rather see Float64Index properly handle astyping inf values to int (raise), and then do self.categories.take(self.codes) with a check if any of the codes is negative, so that we raise.

Ok, so for the purposes of this PR remove the references to np.inf in the code and tests?

TomAugspurger · 2019-09-16T21:36:08Z

That would be my preference. Did my earlier comment make sense?

The NaN issue can be solved by checking Categorical.codes

In [7]: c = pd.Categorical([1, None, 2])

In [8]: c.codes < 0
Out[8]: array([False,  True, False])

The inf issue is deeper, and is present in Float64Index, so I think it should be split to its own PR.

In [9]: c = pd.Categorical([1, np.inf])

In [10]: c.categories.astype(int)
Out[10]: Int64Index([1, -9223372036854775808], dtype='int64')

dsaxton · 2019-09-16T23:17:57Z

That would be my preference. Did my earlier comment make sense?

The NaN issue can be solved by checking Categorical.codes
In [7]: c = pd.Categorical([1, None, 2])

In [8]: c.codes < 0
Out[8]: array([False,  True, False])
The inf issue is deeper, and is present in Float64Index, so I think it should be split to its own PR.
In [9]: c = pd.Categorical([1, np.inf])

In [10]: c.categories.astype(int)
Out[10]: Int64Index([1, -9223372036854775808], dtype='int64')

I believe so, you want to fix the inf casting behavior in one place (Float64Index) and then use the fixed astype method here? Regarding the nan issue, is there a reason to prefer self.codes < 0 over self.isna()?

jreback · 2019-09-17T12:43:01Z

pandas/core/arrays/categorical.py

@@ -520,6 +520,9 @@ def astype(self, dtype: Dtype, copy: bool = True) -> ArrayLike:
            if dtype == self.dtype:
                return self
            return self._set_dtype(dtype)
+        if is_integer_dtype(dtype) and self.isna().any():
+            msg = "Cannot cast to int."


ValueError: cannot convert float NaN to integer is what we do now on
pd.Series([1,2,np.nan],dtype='Int64').astype('int') so would replicate this message

TomAugspurger · 2019-09-17T17:01:29Z

pandas/core/arrays/categorical.py

@@ -520,6 +520,9 @@ def astype(self, dtype: Dtype, copy: bool = True) -> ArrayLike:
            if dtype == self.dtype:
                return self
            return self._set_dtype(dtype)
+        if is_integer_dtype(dtype) and self.isna().any():
+            msg = "Cannot convert float NaN to integer"


Kinda pedantic, but we can have other NA values here.

In [18]: cat = pd.Categorical([pd.Timestamp('2000'), pd.NaT])

In this case, it's not a float that we're refusing to cast. So perhaps Cannot convert NA to integer.

Agree this seems more correct; do we want to think about consistency with Series as @jreback pointed out above? Incidentally I just noticed that Series seems to be misbehaving as well in this special case, so probably worth a separate issue or PR:

[ins] In [5]: pd.Series([pd.Timestamp("2000"), pd.NaT]).astype(int) Out[5]: 0 946684800000000000 1 -9223372036854775808 dtype: int64

pandas/tests/extension/test_categorical.py

jreback · 2019-09-18T12:35:09Z

thanks @dsaxton

Don't cast categorical nan to int

9dd2dbe

dsaxton changed the title ~~Don't cast categorical nan to int~~ BUG: Don't cast categorical nan to int Sep 13, 2019

jbrockmendel reviewed Sep 13, 2019

View reviewed changes

pandas/tests/extension/test_categorical.py Outdated Show resolved Hide resolved

Daniel Saxton added 2 commits September 13, 2019 19:03

Parametrize test

952114f

Add CategoricalIndex test

ffed8a0

mroeschke reviewed Sep 14, 2019

View reviewed changes

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved

Daniel Saxton added 2 commits September 13, 2019 20:49

Use isfinite

504be90

Fix

1290cb2

Check TypeError as well for now

ab21763

Categorical raises a ValueError at the moment, but CategoricalIndex ends up raising a TypeError because this happens during the handling of the ValueError

gfyoung added Bug Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Sep 15, 2019

gfyoung reviewed Sep 15, 2019

View reviewed changes

pandas/tests/extension/test_categorical.py Outdated Show resolved Hide resolved

gfyoung reviewed Sep 15, 2019

View reviewed changes

pandas/tests/extension/test_categorical.py Outdated Show resolved Hide resolved

Daniel Saxton added 7 commits September 15, 2019 08:59

Check error message

858ff06

Fix doc typo

88874dc

Revert "Use isfinite"

eb76e1f

This reverts commit 504be90.

Extract array directly

2a8186a

Keep _maybe_promote

afacbe3

Add note

dbff36f

Fix typo

075ba33

WillAyd reviewed Sep 16, 2019

View reviewed changes

pandas/core/indexes/base.py Outdated Show resolved Hide resolved

TomAugspurger reviewed Sep 16, 2019

View reviewed changes

pandas/core/indexes/base.py Outdated Show resolved Hide resolved

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved

dsaxton and others added 2 commits September 16, 2019 13:15

Update doc/source/whatsnew/v1.0.0.rst

5aeb8b6

Co-Authored-By: Tom Augspurger <TomAugspurger@users.noreply.github.com>

Use np.asarray

af3ff15

Only check NaN

e9cc1fa

dsaxton mentioned this pull request Sep 17, 2019

BUG: Raise when casting infinity to int #28475

Merged

3 tasks

jreback requested changes Sep 17, 2019

View reviewed changes

Change error message

ab7b3ce

TomAugspurger reviewed Sep 17, 2019

View reviewed changes

pandas/tests/extension/test_categorical.py Outdated Show resolved Hide resolved

Add to test cases

754a3ed

dsaxton mentioned this pull request Sep 18, 2019

BUG: Raise when casting NaT to int #28492

Merged

3 tasks

jreback added this to the 1.0 milestone Sep 18, 2019

jreback approved these changes Sep 18, 2019

View reviewed changes

jreback merged commit 045880c into pandas-dev:master Sep 18, 2019

dsaxton deleted the cast-cat branch September 19, 2019 13:41

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

BUG: Don't cast categorical nan to int (pandas-dev#28438)

dff0f45

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

BUG: Don't cast categorical nan to int (pandas-dev#28438)

7e26a93

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Don't cast categorical nan to int #28438

BUG: Don't cast categorical nan to int #28438

dsaxton commented Sep 13, 2019 •

edited

Loading

dsaxton commented Sep 14, 2019 •

edited

Loading

jbrockmendel commented Sep 14, 2019

dsaxton commented Sep 14, 2019

TomAugspurger left a comment

dsaxton commented Sep 16, 2019

TomAugspurger commented Sep 16, 2019

dsaxton commented Sep 16, 2019

jreback Sep 17, 2019

TomAugspurger Sep 17, 2019

dsaxton Sep 17, 2019

jreback commented Sep 18, 2019

BUG: Don't cast categorical nan to int #28438

BUG: Don't cast categorical nan to int #28438

Conversation

dsaxton commented Sep 13, 2019 • edited Loading

dsaxton commented Sep 14, 2019 • edited Loading

jbrockmendel commented Sep 14, 2019

dsaxton commented Sep 14, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

dsaxton commented Sep 16, 2019

TomAugspurger commented Sep 16, 2019

dsaxton commented Sep 16, 2019

jreback Sep 17, 2019

Choose a reason for hiding this comment

TomAugspurger Sep 17, 2019

Choose a reason for hiding this comment

dsaxton Sep 17, 2019

Choose a reason for hiding this comment

jreback commented Sep 18, 2019

dsaxton commented Sep 13, 2019 •

edited

Loading

dsaxton commented Sep 14, 2019 •

edited

Loading