-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Don't cast categorical nan to int #28438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
FWIW the issue with |
should the use_inf_as_na option matter here? |
Categorical raises a ValueError at the moment, but CategoricalIndex ends up raising a TypeError because this happens during the handling of the ValueError
Wasn't familiar with that option, but it shouldn't affect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR should focus on just NaN handling.
I'd rather see Float64Index
properly handle astyping inf values to int (raise), and then do self.categories.take(self.codes)
with a check if any of the codes is negative, so that we raise.
Co-Authored-By: Tom Augspurger <TomAugspurger@users.noreply.github.com>
Ok, so for the purposes of this PR remove the references to |
That would be my preference. Did my earlier comment make sense? The In [7]: c = pd.Categorical([1, None, 2])
In [8]: c.codes < 0
Out[8]: array([False, True, False]) The inf issue is deeper, and is present in Float64Index, so I think it should be split to its own PR. In [9]: c = pd.Categorical([1, np.inf])
In [10]: c.categories.astype(int)
Out[10]: Int64Index([1, -9223372036854775808], dtype='int64') |
I believe so, you want to fix the |
pandas/core/arrays/categorical.py
Outdated
@@ -520,6 +520,9 @@ def astype(self, dtype: Dtype, copy: bool = True) -> ArrayLike: | |||
if dtype == self.dtype: | |||
return self | |||
return self._set_dtype(dtype) | |||
if is_integer_dtype(dtype) and self.isna().any(): | |||
msg = "Cannot cast to int." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueError: cannot convert float NaN to integer
is what we do now on
pd.Series([1,2,np.nan],dtype='Int64').astype('int')
so would replicate this message
@@ -520,6 +520,9 @@ def astype(self, dtype: Dtype, copy: bool = True) -> ArrayLike: | |||
if dtype == self.dtype: | |||
return self | |||
return self._set_dtype(dtype) | |||
if is_integer_dtype(dtype) and self.isna().any(): | |||
msg = "Cannot convert float NaN to integer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda pedantic, but we can have other NA values here.
In [18]: cat = pd.Categorical([pd.Timestamp('2000'), pd.NaT])
In this case, it's not a float that we're refusing to cast. So perhaps Cannot convert NA to integer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree this seems more correct; do we want to think about consistency with Series
as @jreback pointed out above? Incidentally I just noticed that Series
seems to be misbehaving as well in this special case, so probably worth a separate issue or PR:
[ins] In [5]: pd.Series([pd.Timestamp("2000"), pd.NaT]).astype(int)
Out[5]:
0 946684800000000000
1 -9223372036854775808
dtype: int64
thanks @dsaxton |
black pandas
This raises an error when attempting to cast a
Categorical
orCategoricalIndex
containingnans
to an integer dtype. Also had to remove the casting withinget_indexer_non_unique
since this won't always be possible.