Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add errors='coerce' to DataFrame.astype #48781

Open
1 of 3 tasks
joooeey opened this issue Sep 26, 2022 · 6 comments
Open
1 of 3 tasks

ENH: add errors='coerce' to DataFrame.astype #48781

joooeey opened this issue Sep 26, 2022 · 6 comments
Assignees
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement

Comments

@joooeey
Copy link
Contributor

joooeey commented Sep 26, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish I could quickly convert a DataFrame with some invalid data to numeric type and coerce. I thought pd.DataFrame.astype could do that but it doesn't have the option to coerce invalid data to NaNs (or NaTs).

In my particular case I have a DataFrame of sensor readings with mostly NaNs (indicating no value received), many integers (those I care about), and some strings (indicating specific errors). I quickly tried to get a histogram to get an overview of that data but the pd.DataFrame.hist requires numeric data which is a few lines of code to get. This is exploratory code I write in my console, so it would be sweet if this could be done with a single method.

Toy Example

import numpy as np
import pandas as pd

df = pd.DataFrame([
    [np.NaN, 0.1, 1.1, 1.6],
    ["error", 0.2, 1.2, 1.7],
    [0.3, "", 1.3, 1.8],
    [0.4, 1.4, "code255", 1.9],
])

df.astype(float, errors="coerce")
# ValueError: Expected value of kwarg 'errors' to be one of ['raise', 'ignore'].
# Supplied value is 'coerce'

import matplotlib.pyplot as plt
plt.hist(df.values.flatten(), bins=[0, 1, 2])

Expected result:

In [30]: df
Out[30]: 
     0    1    2    3
0  NaN  0.1  1.1  1.6
1  NaN  0.2  1.2  1.7
2  0.3  NaN  1.3  1.8
3  0.4  1.4  NaN  1.9

image

Feature Description

Two options:

  • Allow multidimensional input (e.g. DataFrames, Numpy Arrays) as the arg of pd.to_numeric. In case of mixed type columns (e.g. integers and floats), we'd have to decide and document if that would operate by column or cast the whole data structure to one dtype. I'd expect by column for DataFrames. Another issue that comes up is how to deal with multidimensional lists and tuples.

OR/AND

  • Add the option "coerce" to the errors kwarg in pd.DataFrame.astype (the current options are "raise" and "ignore". We'd have to decide how to deal with incompatibilities between errors="coerce" and dtype. E.g. what to do if someone tries to coerce to string. I would expect an error.

To me it looks like the potential for confusing the user is a lot lower with the second option because it has fewer edge cases.

Alternative Solutions

for col in df.cols:
    df[col] = pd.to_numeric(df[col], errors="coerce")

Additional Context

No response

@joooeey joooeey added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 26, 2022
@subbusainath
Copy link

take

@MarcoGorelli
Copy link
Member

Hi @joooeey

To expedite resolution, could you please include a reproducible example?

Like

  • I have a DataFrame like df = pd.DataFrame(...)
  • I would like to do ...
  • I'd like to see this as the output: ...

@MarcoGorelli MarcoGorelli added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 30, 2022
@joooeey
Copy link
Contributor Author

joooeey commented Oct 4, 2022

@MarcoGorelli I added a toy example to the description.

@MarcoGorelli MarcoGorelli changed the title ENH: ENH: add errors='coerce' to DataFrame.astype Oct 4, 2022
@jbrockmendel
Copy link
Member

Cc @jorisvandenbossche i think you were looking into design decisions related to this

@rhshadrach rhshadrach added Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Info Clarification about behavior needed to assess issue labels Feb 26, 2023
@oricou
Copy link

oricou commented Mar 5, 2025

I would like this feature too but why don't you call it "nan" ?

df.astype(float, errors="nan")

It seems easier to understand that errors will be Nan.

BTW we could choose what should become error:

df.astype(int, errors=-1)

and them your feature becomes

df.astype(float, errors=np.nan)

@joooeey
Copy link
Contributor Author

joooeey commented Mar 6, 2025

I would like this feature too but why don't you call it "nan" ?

df.astype(float, errors="nan")

It seems easier to understand that errors will be Nan.

This has a number of disadvantages:

  • errors="coerce" is already established in the signature of pd.to_numeric. It should be the same in df.astype for consistency.
  • errors="nan" is not a great choice where values become pd.NA or pd.NaT, depending on the dtype.

BTW we could choose what should become error:

df.astype(int, errors=-1)

and them your feature becomes

df.astype(float, errors=np.nan)

This is a great idea, especially for dtypes int, category, str, etc. that don't have a NaN-value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants