-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add errors='coerce' to DataFrame.astype #48781
Comments
take |
Hi @joooeey To expedite resolution, could you please include a reproducible example? Like
|
@MarcoGorelli I added a toy example to the description. |
Cc @jorisvandenbossche i think you were looking into design decisions related to this |
I would like this feature too but why don't you call it "nan" ? df.astype(float, errors="nan") It seems easier to understand that errors will be Nan. BTW we could choose what should become error: df.astype(int, errors=-1) and them your feature becomes df.astype(float, errors=np.nan) |
This has a number of disadvantages:
This is a great idea, especially for dtypes int, category, str, etc. that don't have a NaN-value. |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
I wish I could quickly convert a DataFrame with some invalid data to numeric type and coerce. I thought
pd.DataFrame.astype
could do that but it doesn't have the option to coerce invalid data to NaNs (or NaTs).In my particular case I have a DataFrame of sensor readings with mostly NaNs (indicating no value received), many integers (those I care about), and some strings (indicating specific errors). I quickly tried to get a histogram to get an overview of that data but the
pd.DataFrame.hist
requires numeric data which is a few lines of code to get. This is exploratory code I write in my console, so it would be sweet if this could be done with a single method.Toy Example
Expected result:
Feature Description
Two options:
arg
ofpd.to_numeric
. In case of mixed type columns (e.g. integers and floats), we'd have to decide and document if that would operate by column or cast the whole data structure to one dtype. I'd expect by column for DataFrames. Another issue that comes up is how to deal with multidimensional lists and tuples.OR/AND
"coerce"
to theerrors
kwarg inpd.DataFrame.astype
(the current options are"raise"
and"ignore"
. We'd have to decide how to deal with incompatibilities betweenerrors="coerce"
anddtype
. E.g. what to do if someone tries to coerce to string. I would expect an error.To me it looks like the potential for confusing the user is a lot lower with the second option because it has fewer edge cases.
Alternative Solutions
Additional Context
No response
The text was updated successfully, but these errors were encountered: