Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.to_numeric has an inconsistent behavior for datetime objects #43280

Open
3 tasks done
hec10r opened this issue Aug 29, 2021 · 16 comments
Open
3 tasks done

BUG: pd.to_numeric has an inconsistent behavior for datetime objects #43280

hec10r opened this issue Aug 29, 2021 · 16 comments
Assignees
Labels
Bug Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions

Comments

@hec10r
Copy link

hec10r commented Aug 29, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

>>> import pandas as pd
>>> from datetime import datetime
>>> pd.to_numeric(datetime(2021, 8, 22), errors="coerce")
nan
>>> pd.to_numeric(pd.Series(datetime(2021, 8, 22)), errors="coerce")
0    1629590400000000000
dtype: int64
>>> pd.Series([datetime(2021, 8, 22)]).apply(partial(pd.to_numeric), errors="coerce")
0   NaN
dtype: float64
>>>
>>> pd.to_numeric(pd.NaT, errors="coerce")
nan
>>> pd.to_numeric(pd.Series(pd.NaT), errors="coerce")
0   -9223372036854775808
dtype: int64
>>> pd.Series([pd.NaT]).apply(partial(pd.to_numeric), errors="coerce")
0   NaN
dtype: float64

Problem description

When using pd.to_numeric to convert a pd.Series with dtype datetime64[ns], it returns different values than converting the series value by value

Expected Output

Converting a pd.Series as a whole should be the same than converting it value by value.
I am not sure about what the correct output should be, but IMO the output should be consistent in these two scenarios.

What I suggest:

  • For no-null values, returns the same value. Maybe the integer?
  • For pd.NaT, always returns np.NaN

Output of pd.show_versions()

I am using the latest version of master until today

INSTALLED VERSIONS

commit : e39ea30
python : 3.8.3.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Tue Jun 22 19:49:55 PDT 2021; root:xnu-6153.141.35~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8

pandas : 1.4.0.dev0+517.gc3761e24d8
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0.post20200714
Cython : 0.29.21
pytest : 5.4.3
hypothesis : None
sphinx : 3.1.2
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.0
sqlalchemy : 1.3.18
tables : 3.6.1
tabulate : 0.8.9
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.50.1

@hec10r hec10r added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 29, 2021
@DAKSHA2001
Copy link

Assign me this issue. I will solve it. I have to make an open source contribution as part of Microsoft Research Intern Role so assign this to me

@ShreyasPatel031
Copy link

take

@Navaneethan2503
Copy link

this PR will close this issue #43289 , thanks for this issue @hec10r .
closes #43280

@ShreyasPatel031 ShreyasPatel031 removed their assignment Aug 30, 2021
@hec10r
Copy link
Author

hec10r commented Aug 30, 2021

take

@hec10r
Copy link
Author

hec10r commented Aug 30, 2021

After some investigation, it looks like this behavior is explained by: numpy/numpy#19782, so not sure if this is a pandas issue or needs to be fixed in numpy. Before creating a PR here I would like to have the input from the maintainers.

@mroeschke
Copy link
Member

The key point to note is that datetime objects in pandas get converted to datetime64[ns] (this has been the convention for a while) and not explicitly an object unless dtype=object is specified. So that being said:

Incorrect, should be 1629590400000000000

>>> pd.to_numeric(datetime(2021, 8, 22), errors="coerce")
nan

Correct

>>> pd.to_numeric(pd.Series(datetime(2021, 8, 22)), errors="coerce")
0    1629590400000000000
dtype: int64

Incorrect, should be 1629590400000000000

>>> pd.Series([datetime(2021, 8, 22)]).apply(partial(pd.to_numeric), errors="coerce")
0   NaN
dtype: float64

Correct

>>> pd.to_numeric(pd.NaT, errors="coerce")
nan

Incorrect, probably related to #16674

>>> pd.to_numeric(pd.Series(pd.NaT), errors="coerce")
0   -9223372036854775808
dtype: int64

Correct

>>> pd.Series([pd.NaT]).apply(partial(pd.to_numeric), errors="coerce")
0   NaN

@hec10r
Copy link
Author

hec10r commented Aug 31, 2021

Hi Matthew, thanks for answering.

That's the other approach that I considered, but since there weren't documentation of the pd.to_numeric behavior for date-like objects and I didn't find it very intuitive, I thought that changing the whole behavior for something more intuitive would be good.

The use case where I am struggling with is this one:

>>> import pandas as pd
>>> from datetime import datetime
>>> pd.to_numeric(pd.Series([datetime(2021,8,22)]), errors="raise") # Same for errors="ignore"/"coerce"
0    1629590400000000000
dtype: int64
>>> pd.to_numeric(pd.Series(["apple", 1, datetime(2021,8,22)]), errors="coerce")
0    NaN
1    1.0
2    NaN
dtype: float64
>>> pd.to_numeric(pd.Series(["apple", 1, datetime(2021,8,22)]), errors="ignore")
0                  apple
1                      1
2    2021-08-22 00:00:00
dtype: object

Is this desired/expected? Should we return 1629590400000000000 in the three cases?

@mroeschke
Copy link
Member

Sorry, yes the documentation could use improvement to address how datetime-like objects (e.g. np.datetime64, datetime.datetime, etc) are treated.

I would expect 1629590400000000000 in all 3 cases, thought this case is a bit tricky since the entire Series has dtype=object, but certain elements in that Series can be converted to numbers as it aligns with what you mentioned in your OP

Converting a pd.Series as a whole should be the same as converting it value by value.

@hec10r
Copy link
Author

hec10r commented Aug 31, 2021

IMO, the best approach would be to have a function pd._scalar_to_numericand then call it inside pd.to_numeric, something like:

from functools import partial
def to_numeric(arg, errors="raise", downcast=None):
    if is_scalar(arg):
        return pd._scalar_to_numeric(arg, errors=errors, downcast=downcast)
    elif is_series_or_index(arg):
        return arg.apply(pd._scalar_to_numeric, errors=errors, downcast=downcast)
    elif is_list_tuple_or_np_array(arg):
        # Maybe keeping the dtype as well for `np.array`?
        return np.array(map(partial(pd._scalar_to_numeric, errors=errors, downcast=downcast), arg)

What do you think?

@mroeschke
Copy link
Member

Probably the best approach would be to modify this line here to handle datetime like scalars (use one of the existing functions in pandas.core.dtypes.common

elif is_scalar(arg):

@hec10r
Copy link
Author

hec10r commented Sep 1, 2021

By only doing that, the problem will persist for lists, tuples and np.arrays

@mroeschke
Copy link
Member

Best to address scalars vs array-likes in separate PRs. I imagine addressing array-likes may need to use one of the datetime inference functions for arrays

@hec10r
Copy link
Author

hec10r commented Sep 1, 2021

I don't see a way of fixing array-likes without using the same logic than for scalars. Inferring the type of the array isn't enough given all the possible cases. Having a function that handles scalars and then mapping this function element by element for iterables seems to be the easiest solution.

@mroeschke
Copy link
Member

mapping this function element by element for iterables seems to be the easiest solution

This will kill performance for the existing cases, so that implementation is probably a non-starter.

@hec10r
Copy link
Author

hec10r commented Sep 1, 2021

Understood. Then let's split the problem

  1. PR to fix behavior for scalar date-like objects, including pd.NaT (will do this before weekend)
  2. PR to fix behavior for iterables: infer types for list/tuples and use current approach for np.array when type is number-like. Approach for mixed types TBF (open to discuss and implement)

@mroeschke mroeschke added Dtype Conversions Unexpected or buggy dtype conversions Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 30, 2021
@jack5github
Copy link

This is still an ongoing issue as of the 13th of February, 2025. This could theoretically be fixed by using the dtype Int64, which has support for NA values.

from datetime import datetime
import pandas as pd

print(pd.to_numeric(pd.Series([datetime(2025, 2, 13), pd.NaT])))

"""
0    1739404800000000000
1   -9223372036854775808
dtype: int64
"""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
6 participants