Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.to_numeric - float64, object or error? #17007

Closed
mficek opened this issue Jul 18, 2017 · 13 comments
Closed

pd.to_numeric - float64, object or error? #17007

mficek opened this issue Jul 18, 2017 · 13 comments
Labels
API Design Bug Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@mficek
Copy link

mficek commented Jul 18, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd

print pd.__version__

#different result example
d = pd.DataFrame({'a':[200, 300, '', 'NaN', 10000000000000000000]})

#returns dtype object
a = pd.to_numeric(d['a'], errors='coerce')
print a.dtype

#return dtype float64
b = d['a'].apply(pd.to_numeric, errors='coerce')
print b.dtype

#why not float64?
d = pd.DataFrame({'a':[200, 300, '', 'NaN', 30000000000000000000]})

#returns dtype object
a = pd.to_numeric(d['a'], errors='coerce')
print a.dtype

#returns OverflowError
b = d['a'].apply(pd.to_numeric, errors='coerce')
print b.dtype

Problem description

Hi guys, I realized that result of to_numeric changes depending on the way you pass a Series to that function. Please see example above. When I call to_numeric with series passed as parameter, it returns "object", but when I apply to_numeric to that series, it returns "float64". Moreover, I'm a bit confused what's the correct behavior of to_numeric, why it doesn't convert looooong int-like number to float64? It throws an exception from which I can't even deduce which number (position, index) caused that exception...

I'm pretty sure my issue is being discussed somewhere already, I tried to search the proper issue but rather found bits and pieces about to_numeric and convertions in general. Please feel free to put my issue in more appropriate thread.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.13.1
scipy: None
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@chris-b1
Copy link
Contributor

This case is clearly a bug:

In [18]: pd.to_numeric([200, 300, '', 'NaN', 10000000000000000000], errors='coerce')
Out[18]: array([200, 300, '', 'NaN', 10000000000000000000], dtype=object)

But not entirely clear how we should be handling "big" integers. numpy does not cast to a float on default construction, but we could here.

In [22]: np.array([10000000000000000000])
Out[22]: array([10000000000000000000], dtype=uint64)

In [23]: np.array([30000000000000000000])
Out[23]: array([30000000000000000000], dtype=object)

@chris-b1 chris-b1 added API Design Bug Dtype Conversions Unexpected or buggy dtype conversions labels Jul 18, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 18, 2017

@chris-b1 : Your first example, I can explain. You can't hold both NaN and uint64 together in the same numeric dtype without losing precision, so we will refuse to cast if that's the case. That's why you get the output that you get, so no, that is not a bug IMO.

As for your point about "big" integers, do keep in mind that you can't hold integers above np.iinfo(np.uinit64).max. That's just the way numpy dtypes are currently designed. Thus, the second box is moot at this point.

@gfyoung
Copy link
Member

gfyoung commented Jul 18, 2017

@mficek : My explanation about not holding NaN and uint64 together in the same numeric dtype applies for your two examples where you called pd.to_numeric on the entire Series.

For your first .apply example, you should try printing out the output of the to_numeric call each time and the data type. What you will find is that the numeric types will not match (you have uint64, int64, and float). In this case, we just collate the outputs together into an array-like (we have no idea that the function being used is to_numeric, so precision is moot), leading to the forced casting of float64.

Of the four examples, only your last one I would consider a bug. When we coerce, we should not crash because any "bad" elements are replaced with NaN. That means we need to catch OverflowError somewhere in the implementation.

@chris-b1
Copy link
Contributor

  1. The bug is that we are returning the object unchanged with errors='coerce' - anything unconvertible should be NaN

  2. @mficek was suggesting that we upcast to float in that scenario (which happens to be lossless with this particular value)

In [27]: np.array([30000000000000000000], dtype='float64')
Out[27]: array([  3.00000000e+19])

@gfyoung
Copy link
Member

gfyoung commented Jul 18, 2017

The bug is that we are returning the object unchanged with errors='coerce' - anything unconvertible should be NaN

Ah, I see...I thought it was erroring an element earlier. Yes, that NaN should just be an empty string.

@mficek was suggesting that we upcast to float in that scenario (which happens to be lossless with this particular value)

We made a deliberate decision to not do that. uint64 --> float64 casting is not something we want to do because we more often than not are losing precision. I suppose you could try to check each number to see if it won't lose precision, but that's going to be a performance hit I can imagine.

@mficek
Copy link
Author

mficek commented Jul 18, 2017

Thanks for your explanation! I figured some workaround that works for me with pandas==0.20.3 (basically converting to string, adding '.' at the end of every "number" which leads to conversion to 'float64' always and then checking for elements which are bigger than np.iinfo(np.uint64).max, which I replace with np.nan. Then I drop nans and covert everything to uint64 (which is the type I need in my code).

    d[col] = (d[col].astype(str) + '.').apply(pd.to_numeric, errors='coerce')
    d.loc[d[col]>np.iinfo(typ).max, col] = np.nan
    d[col] = d[col].dropna()
    d[col] = d[col].astype(np.uint64)

Nothing nice, but it works. Can I be anyhow helpful in resolving this issue? Despite being a big fan of pandas, I never contributed to it's code...

@gfyoung
Copy link
Member

gfyoung commented Jul 18, 2017

@chris-b1 : I actually can explain why the first and third examples behave as they do. When you try to convert the "NaN" to nan, that conflicts with the uint64 at the end. Thus, we refuse to coerce (or mutate) any values because of that potential loss of precision.

To be fair, this is not well-documented, and we should at least document this (unless people really think this behavior should be changed).

@gfyoung
Copy link
Member

gfyoung commented Jul 18, 2017

Can I be anyhow helpful in resolving this issue?

@mficek : Absolutely! The following code locations will help you out here:

  1. to_numeric can be found here

  2. maybe_convert_numeric can be found here (you will encounter this function in the implementation of to_numeric)

Try to figure out why the OverflowError is being raised in your last example. It seems to be triggered from maybe_convert_objects (you can find that here), but walk through the code and see if you can see why.

Finally, here is the link to documentation for contributing to pandas - that will also give you instructions on how to set up a development environment that you don't overwrite your existing pandas installation.

@chris-b1
Copy link
Contributor

chris-b1 commented Jul 18, 2017

I think that that errors='coerce' should force everything to float64 if anything is missing. It's not fundamentally different than, our normal treatment of ints in the presence of missing data, and it's an opt-in, deliberately lossy kwarg.

In [6]: pd.to_numeric([1, 2, ''], errors='coerce')
Out[6]: array([  1.,   2.,  nan])

That said, I do see your point that in the range of int64max to uint64max it becomes "lossier"

@gfyoung
Copy link
Member

gfyoung commented Jul 18, 2017

It's not fundamentally different than, our normal treatment of ints in the presence of missing data, and it's an opt-in, deliberately lossy kwarg.

@chris-b1 : I would disagree with you on that. uint is a different beast from int, which is why we have had problems with handling them for some time. For starters, it has no nan like int does, which makes it trickier to handle it in conjunction with missing data.

Also, having working on to_numeric before, previous discussion on this topic has centered around making conversion operations like these as least precision-destructive as possible. Thus, I would push-back against making coercion "lossy" for uint64 given the issues I've had to patch to prevent it from happening in the first place.

@chris-b1
Copy link
Contributor

For starters, it has no nan like int does,

int cannot represent nan either

In [1]: np.array([1, 2, np.nan]).dtype
Out[1]: dtype('float64')

@gfyoung
Copy link
Member

gfyoung commented Jul 18, 2017

int cannot represent nan either

Sorry, what I meant is that we don't have a placeholder for missing uint like we do for int (I don't count nan because of the precision lossiness).

@jreback
Copy link
Contributor

jreback commented Jul 19, 2017

I agree with @chris-b1 here.

It doesn't matter what is there when we have errors='coerce'. If its not lowest common dtype convertible, then it gets a NaN. Now during the conversion you may be inferring and you see a uint64, then a NaN so we go back to object, but I agree that is buggy, it should return float64 if it has to coerce.

@jorisvandenbossche jorisvandenbossche added this to the Next Major Release milestone Jul 31, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 9, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 9, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 9, 2017
@jreback jreback modified the milestones: Next Major Release, 0.21.0 Oct 9, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 9, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this issue Oct 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

5 participants