pd.to_numeric - float64, object or error? #17007

mficek · 2017-07-18T08:29:47Z

Code Sample, a copy-pastable example if possible

import pandas as pd

print pd.__version__

#different result example
d = pd.DataFrame({'a':[200, 300, '', 'NaN', 10000000000000000000]})

#returns dtype object
a = pd.to_numeric(d['a'], errors='coerce')
print a.dtype

#return dtype float64
b = d['a'].apply(pd.to_numeric, errors='coerce')
print b.dtype

#why not float64?
d = pd.DataFrame({'a':[200, 300, '', 'NaN', 30000000000000000000]})

#returns dtype object
a = pd.to_numeric(d['a'], errors='coerce')
print a.dtype

#returns OverflowError
b = d['a'].apply(pd.to_numeric, errors='coerce')
print b.dtype

Problem description

Hi guys, I realized that result of to_numeric changes depending on the way you pass a Series to that function. Please see example above. When I call to_numeric with series passed as parameter, it returns "object", but when I apply to_numeric to that series, it returns "float64". Moreover, I'm a bit confused what's the correct behavior of to_numeric, why it doesn't convert looooong int-like number to float64? It throws an exception from which I can't even deduce which number (position, index) caused that exception...

I'm pretty sure my issue is being discussed somewhere already, I tried to search the proper issue but rather found bits and pieces about to_numeric and convertions in general. Please feel free to put my issue in more appropriate thread.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.13.1
scipy: None
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2017-07-18T14:40:06Z

This case is clearly a bug:

In [18]: pd.to_numeric([200, 300, '', 'NaN', 10000000000000000000], errors='coerce')
Out[18]: array([200, 300, '', 'NaN', 10000000000000000000], dtype=object)

But not entirely clear how we should be handling "big" integers. numpy does not cast to a float on default construction, but we could here.

In [22]: np.array([10000000000000000000])
Out[22]: array([10000000000000000000], dtype=uint64)

In [23]: np.array([30000000000000000000])
Out[23]: array([30000000000000000000], dtype=object)

gfyoung · 2017-07-18T15:10:02Z

@chris-b1 : Your first example, I can explain. You can't hold both NaN and uint64 together in the same numeric dtype without losing precision, so we will refuse to cast if that's the case. That's why you get the output that you get, so no, that is not a bug IMO.

As for your point about "big" integers, do keep in mind that you can't hold integers above np.iinfo(np.uinit64).max. That's just the way numpy dtypes are currently designed. Thus, the second box is moot at this point.

gfyoung · 2017-07-18T15:14:00Z

@mficek : My explanation about not holding NaN and uint64 together in the same numeric dtype applies for your two examples where you called pd.to_numeric on the entire Series.

For your first .apply example, you should try printing out the output of the to_numeric call each time and the data type. What you will find is that the numeric types will not match (you have uint64, int64, and float). In this case, we just collate the outputs together into an array-like (we have no idea that the function being used is to_numeric, so precision is moot), leading to the forced casting of float64.

Of the four examples, only your last one I would consider a bug. When we coerce, we should not crash because any "bad" elements are replaced with NaN. That means we need to catch OverflowError somewhere in the implementation.

chris-b1 · 2017-07-18T15:14:12Z

The bug is that we are returning the object unchanged with errors='coerce' - anything unconvertible should be NaN
@mficek was suggesting that we upcast to float in that scenario (which happens to be lossless with this particular value)

In [27]: np.array([30000000000000000000], dtype='float64')
Out[27]: array([  3.00000000e+19])

gfyoung · 2017-07-18T15:19:06Z

The bug is that we are returning the object unchanged with errors='coerce' - anything unconvertible should be NaN

Ah, I see...I thought it was erroring an element earlier. Yes, that NaN should just be an empty string.

@mficek was suggesting that we upcast to float in that scenario (which happens to be lossless with this particular value)

We made a deliberate decision to not do that. uint64 --> float64 casting is not something we want to do because we more often than not are losing precision. I suppose you could try to check each number to see if it won't lose precision, but that's going to be a performance hit I can imagine.

mficek · 2017-07-18T16:30:55Z

Thanks for your explanation! I figured some workaround that works for me with pandas==0.20.3 (basically converting to string, adding '.' at the end of every "number" which leads to conversion to 'float64' always and then checking for elements which are bigger than np.iinfo(np.uint64).max, which I replace with np.nan. Then I drop nans and covert everything to uint64 (which is the type I need in my code).

    d[col] = (d[col].astype(str) + '.').apply(pd.to_numeric, errors='coerce')
    d.loc[d[col]>np.iinfo(typ).max, col] = np.nan
    d[col] = d[col].dropna()
    d[col] = d[col].astype(np.uint64)

Nothing nice, but it works. Can I be anyhow helpful in resolving this issue? Despite being a big fan of pandas, I never contributed to it's code...

gfyoung · 2017-07-18T16:47:52Z

@chris-b1 : I actually can explain why the first and third examples behave as they do. When you try to convert the "NaN" to nan, that conflicts with the uint64 at the end. Thus, we refuse to coerce (or mutate) any values because of that potential loss of precision.

To be fair, this is not well-documented, and we should at least document this (unless people really think this behavior should be changed).

gfyoung · 2017-07-18T16:54:23Z

Can I be anyhow helpful in resolving this issue?

@mficek : Absolutely! The following code locations will help you out here:

to_numeric can be found here
maybe_convert_numeric can be found here (you will encounter this function in the implementation of to_numeric)

Try to figure out why the OverflowError is being raised in your last example. It seems to be triggered from maybe_convert_objects (you can find that here), but walk through the code and see if you can see why.

Finally, here is the link to documentation for contributing to pandas - that will also give you instructions on how to set up a development environment that you don't overwrite your existing pandas installation.

chris-b1 · 2017-07-18T16:59:27Z

I think that that errors='coerce' should force everything to float64 if anything is missing. It's not fundamentally different than, our normal treatment of ints in the presence of missing data, and it's an opt-in, deliberately lossy kwarg.

In [6]: pd.to_numeric([1, 2, ''], errors='coerce')
Out[6]: array([  1.,   2.,  nan])

That said, I do see your point that in the range of int64max to uint64max it becomes "lossier"

gfyoung · 2017-07-18T17:16:09Z

It's not fundamentally different than, our normal treatment of ints in the presence of missing data, and it's an opt-in, deliberately lossy kwarg.

@chris-b1 : I would disagree with you on that. uint is a different beast from int, which is why we have had problems with handling them for some time. For starters, it has no nan like int does, which makes it trickier to handle it in conjunction with missing data.

Also, having working on to_numeric before, previous discussion on this topic has centered around making conversion operations like these as least precision-destructive as possible. Thus, I would push-back against making coercion "lossy" for uint64 given the issues I've had to patch to prevent it from happening in the first place.

chris-b1 · 2017-07-18T17:24:47Z

For starters, it has no nan like int does,

int cannot represent nan either

In [1]: np.array([1, 2, np.nan]).dtype
Out[1]: dtype('float64')

gfyoung · 2017-07-18T17:30:02Z

int cannot represent nan either

Sorry, what I meant is that we don't have a placeholder for missing uint like we do for int (I don't count nan because of the precision lossiness).

jreback · 2017-07-19T01:11:33Z

I agree with @chris-b1 here.

It doesn't matter what is there when we have errors='coerce'. If its not lowest common dtype convertible, then it gets a NaN. Now during the conversion you may be inferring and you see a uint64, then a NaN so we go back to object, but I agree that is buggy, it should return float64 if it has to coerce.

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

Closes gh-17007. Closes gh-17125.

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

chris-b1 added API Design Bug Dtype Conversions Unexpected or buggy dtype conversions labels Jul 18, 2017

mficek mentioned this issue Jul 31, 2017

Column is not converting to numeric when errors=coerce #17125

Closed

jorisvandenbossche added this to the Next Major Release milestone Jul 31, 2017

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 9, 2017

BUG: Coerce to numeric despite uint64 conflict

23dbbb1

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

gfyoung mentioned this issue Oct 9, 2017

BUG: Coerce to numeric despite uint64 conflict #17823

Merged

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 9, 2017

BUG: Coerce to numeric despite uint64 conflict

16bf62e

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 9, 2017

BUG: Coerce to numeric despite uint64 conflict

0904a55

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

jreback modified the milestones: Next Major Release, 0.21.0 Oct 9, 2017

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 9, 2017

BUG: Coerce to numeric despite uint64 conflict

93e52e9

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

jreback closed this as completed in #17823 Oct 9, 2017

jreback pushed a commit that referenced this issue Oct 9, 2017

BUG: Coerce to numeric despite uint64 conflict (#17823)

3ba2cff

Closes gh-17007. Closes gh-17125.

ghost pushed a commit to reef-technologies/pandas that referenced this issue Oct 16, 2017

BUG: Coerce to numeric despite uint64 conflict (pandas-dev#17823)

987a8d4

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017

BUG: Coerce to numeric despite uint64 conflict (pandas-dev#17823)

c8cb64d

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

BUG: Coerce to numeric despite uint64 conflict (pandas-dev#17823)

e9015c3

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

fakio mentioned this issue Jan 24, 2019

pd.to_numeric silently not converting when errors='coerce' #24910

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.to_numeric - float64, object or error? #17007

pd.to_numeric - float64, object or error? #17007

mficek commented Jul 18, 2017

chris-b1 commented Jul 18, 2017

gfyoung commented Jul 18, 2017 •

edited

Loading

gfyoung commented Jul 18, 2017

chris-b1 commented Jul 18, 2017

gfyoung commented Jul 18, 2017

mficek commented Jul 18, 2017

gfyoung commented Jul 18, 2017

gfyoung commented Jul 18, 2017 •

edited

Loading

chris-b1 commented Jul 18, 2017 •

edited

Loading

gfyoung commented Jul 18, 2017

chris-b1 commented Jul 18, 2017

gfyoung commented Jul 18, 2017

jreback commented Jul 19, 2017

pd.to_numeric - float64, object or error? #17007

pd.to_numeric - float64, object or error? #17007

Comments

mficek commented Jul 18, 2017

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

chris-b1 commented Jul 18, 2017

gfyoung commented Jul 18, 2017 • edited Loading

gfyoung commented Jul 18, 2017

chris-b1 commented Jul 18, 2017

gfyoung commented Jul 18, 2017

mficek commented Jul 18, 2017

gfyoung commented Jul 18, 2017

gfyoung commented Jul 18, 2017 • edited Loading

chris-b1 commented Jul 18, 2017 • edited Loading

gfyoung commented Jul 18, 2017

chris-b1 commented Jul 18, 2017

gfyoung commented Jul 18, 2017

jreback commented Jul 19, 2017

Output of `pd.show_versions()`

gfyoung commented Jul 18, 2017 •

edited

Loading

gfyoung commented Jul 18, 2017 •

edited

Loading

chris-b1 commented Jul 18, 2017 •

edited

Loading