Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Replacing behavior currently provided by pandas.to_numeric using errors="ignore" #59221

Open
1 of 3 tasks
ahobeost opened this issue Jul 10, 2024 · 7 comments
Open
1 of 3 tasks
Labels
Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@ahobeost
Copy link

ahobeost commented Jul 10, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

As a User, I am reading in CSV files with mixed types in certain columns.
Specifically, the column can have a value, or a file path to a list of values.
With all current quoting flags, the df.read_csv turns this into an object column where the floats or ints are quoted as well.

Thus, turning all nrs into numerics while keeping all file paths as strings is easily done as follows:
df[column_name] = df[column_name].apply(_pd.to_numeric, errors="ignore")

Thus, as the solution above is being sunsetted, I wish pandas could provide a simple solution to get object columns with mixed dtypes which treat nrs as numerics from csv files.

Feature Description

usage:
pd.to_numeric(preserve_text=True)

implementation

Alternative Solutions

usage:
csv_read(to_numeric_in_mixed_columns=True, preserve_text_in_mixed_columns=True)

Additional Context

#54467
#43280
pypest/pyemu#485

The proposed workaround in the pypest issue is not satisfactory as this should be a common occurrence with a common solution.

@ahobeost ahobeost added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 10, 2024
@mepearson
Copy link

upvote for this issue

@snitish snitish added Needs Info Clarification about behavior needed to assess issue Needs Triage Issue that has not been reviewed by a pandas team member and removed Needs Triage Issue that has not been reviewed by a pandas team member Needs Info Clarification about behavior needed to assess issue labels Mar 26, 2025
@snitish
Copy link
Member

snitish commented Mar 26, 2025

@mroeschke this looks like a legitimate use case of the errors="ignore" feature of to_numeric that was deprecated in v2.2. Any ideas on how we can add this functionality back, either within to_numeric or as a new function? So far the only acceptable workaround seems to be this, but it's not particularly clean.

@snitish snitish added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 26, 2025
@ahobeost
Copy link
Author

Thanks @snitish.
Two suggestions:

  • The User could provide allowed types to to_numeric:
    _pd.to_numeric(..., allowed_types=[float, datetime])

  • Another function could be built which inherently deals with mixed columns:

    _pd.to_mixed(..., expected_types=[int, float, datetime]
    _pd.interpret_as_mixed_column()
    _pd.allow_expected_types()
    

@ahobeost
Copy link
Author

Based on some initial tests with df.convert_dtypes, the two suggested options will likely not work.

Thus I add a third suggestion:
_pd.to_numeric(..., keep_original_strings=True)

If an error occurs, it keeps the original string for that entry.

@mepearson
Copy link

mepearson commented Mar 26, 2025

Based on some initial tests with df.convert_dtypes, the two suggested options will likely not work.

Thus I add a third suggestion: _pd.to_numeric(..., keep_original_strings=True)

If an error occurs, it keeps the original string for that entry.

This would work for us and is the behavior we are using in our use case.

For reference, the use case is a .csv that serves as a controlled vocabulary for defined terms in order to map between fields-values-UI display terms. It needs to be centralized in a single file as the actual values/UI meanings are reviewed/maintained by SME vs. developers

The code in question creates a dict that splits this apart, and could potentially be handled with an if/then since within a given field things are either all numeric or all not. But the 'if this is numeric, make it numeric, and if not leave it as a string' is the intended behavior as currently written.

Current function in use:

def get_terms_dictionary(display_terms, api_field, api_value, display_col):
    '''from a dataframe with the table display information, create a dictionary by field to match the database
    value to a value for use in the UI '''
    try:
        display_terms_list = display_terms[api_field].unique() # List of fields with matching display terms

        # Create a dictionary using the field as the key, and the dataframe to map database values to display text as the value
        display_terms_dict = {}
        for i in display_terms_list:
            term_df = display_terms[display_terms.api_field == i]
            term_df = term_df[[api_value,display_col]]
            term_df = term_df.rename(columns={api_value: i, display_col: i + '_display'})
            term_df = term_df.apply(pd.to_numeric, errors='ignore')
            display_terms_dict[i] = term_df
        return display_terms_dict

    except Exception as e:
        traceback.print_exc()
        return None```

@ahobeost
Copy link
Author

Based on the comment by @mepearson, I add our example here.
In optihood, the User provides a lot of information related to energy network for neighborhood of buildings.
As part of this information, some columns are either constant values, or string paths to a csv file.
Thus, the number should be changed to an int or float, and the string should remain.

Image
https://github.com/SPF-OST/optihood/blob/main/data/CSVs/basic_example_CSVs/commodity_sources.csv

To complicate things, the cells can also be empty.

Image
https://github.com/SPF-OST/optihood/blob/main/data/CSVs/basic_example_CSVs/buses.csv

It would be nice if these empty cells remain interpreted as NaN.

@mroeschke
Copy link
Member

Thanks for the suggestion, but I would be -1 on modifying to_numeric in this way.

Although it's not "clean" just to do

def parse_numbers(x):
    try:
        return pd.to_numeric(x)
    except Exception:
        return x

df[col].apply(parse_numbers)

This behavior was intentionally deprecated in pandas 2.x to avoid potentially returning object columns which don't have great support and poor performance characteristics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants