Skip to content

BUG: Don't cast categorical nan to int #28438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Sep 18, 2019
Merged

BUG: Don't cast categorical nan to int #28438

merged 18 commits into from
Sep 18, 2019

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Sep 13, 2019

This raises an error when attempting to cast a Categorical or CategoricalIndex containing nans to an integer dtype. Also had to remove the casting within get_indexer_non_unique since this won't always be possible.

@dsaxton dsaxton changed the title Don't cast categorical nan to int BUG: Don't cast categorical nan to int Sep 13, 2019
Daniel Saxton added 2 commits September 13, 2019 20:49
@dsaxton
Copy link
Member Author

dsaxton commented Sep 14, 2019

FWIW the issue with test_get_indexer_non_unique is here https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/base.py#L4717. If target is Categorical then the values can't always be cast to the dtype of the categories.

@jbrockmendel
Copy link
Member

should the use_inf_as_na option matter here?

Categorical raises a ValueError at the moment, but CategoricalIndex
ends up raising a TypeError because this happens during the handling
of the ValueError
@dsaxton
Copy link
Member Author

dsaxton commented Sep 14, 2019

should the use_inf_as_na option matter here?

Wasn't familiar with that option, but it shouldn't affect numpy behavior I think?

@gfyoung gfyoung added Bug Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Sep 15, 2019
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR should focus on just NaN handling.

I'd rather see Float64Index properly handle astyping inf values to int (raise), and then do self.categories.take(self.codes) with a check if any of the codes is negative, so that we raise.

dsaxton and others added 2 commits September 16, 2019 13:15
Co-Authored-By: Tom Augspurger <TomAugspurger@users.noreply.github.com>
@dsaxton
Copy link
Member Author

dsaxton commented Sep 16, 2019

I think this PR should focus on just NaN handling.

I'd rather see Float64Index properly handle astyping inf values to int (raise), and then do self.categories.take(self.codes) with a check if any of the codes is negative, so that we raise.

Ok, so for the purposes of this PR remove the references to np.inf in the code and tests?

@TomAugspurger
Copy link
Contributor

That would be my preference. Did my earlier comment make sense?

The NaN issue can be solved by checking Categorical.codes

In [7]: c = pd.Categorical([1, None, 2])

In [8]: c.codes < 0
Out[8]: array([False,  True, False])

The inf issue is deeper, and is present in Float64Index, so I think it should be split to its own PR.

In [9]: c = pd.Categorical([1, np.inf])

In [10]: c.categories.astype(int)
Out[10]: Int64Index([1, -9223372036854775808], dtype='int64')

@dsaxton
Copy link
Member Author

dsaxton commented Sep 16, 2019

That would be my preference. Did my earlier comment make sense?

The NaN issue can be solved by checking Categorical.codes

In [7]: c = pd.Categorical([1, None, 2])

In [8]: c.codes < 0
Out[8]: array([False,  True, False])

The inf issue is deeper, and is present in Float64Index, so I think it should be split to its own PR.

In [9]: c = pd.Categorical([1, np.inf])

In [10]: c.categories.astype(int)
Out[10]: Int64Index([1, -9223372036854775808], dtype='int64')

I believe so, you want to fix the inf casting behavior in one place (Float64Index) and then use the fixed astype method here? Regarding the nan issue, is there a reason to prefer self.codes < 0 over self.isna()?

@@ -520,6 +520,9 @@ def astype(self, dtype: Dtype, copy: bool = True) -> ArrayLike:
if dtype == self.dtype:
return self
return self._set_dtype(dtype)
if is_integer_dtype(dtype) and self.isna().any():
msg = "Cannot cast to int."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueError: cannot convert float NaN to integer is what we do now on
pd.Series([1,2,np.nan],dtype='Int64').astype('int') so would replicate this message

@@ -520,6 +520,9 @@ def astype(self, dtype: Dtype, copy: bool = True) -> ArrayLike:
if dtype == self.dtype:
return self
return self._set_dtype(dtype)
if is_integer_dtype(dtype) and self.isna().any():
msg = "Cannot convert float NaN to integer"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda pedantic, but we can have other NA values here.

In [18]: cat = pd.Categorical([pd.Timestamp('2000'), pd.NaT])

In this case, it's not a float that we're refusing to cast. So perhaps Cannot convert NA to integer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this seems more correct; do we want to think about consistency with Series as @jreback pointed out above? Incidentally I just noticed that Series seems to be misbehaving as well in this special case, so probably worth a separate issue or PR:

[ins] In [5]: pd.Series([pd.Timestamp("2000"), pd.NaT]).astype(int)             
Out[5]: 
0     946684800000000000
1   -9223372036854775808
dtype: int64

@dsaxton dsaxton mentioned this pull request Sep 18, 2019
3 tasks
@jreback jreback added this to the 1.0 milestone Sep 18, 2019
@jreback jreback merged commit 045880c into pandas-dev:master Sep 18, 2019
@jreback
Copy link
Contributor

jreback commented Sep 18, 2019

thanks @dsaxton

@dsaxton dsaxton deleted the cast-cat branch September 19, 2019 13:41
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Converting from categorical to int ignores NaNs
7 participants