Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.Series.interpolate(method="quadratic) Error with non-numeric index column #21662

Closed
kevinislas2 opened this issue Jun 28, 2018 · 3 comments · Fixed by #25394
Closed

pd.Series.interpolate(method="quadratic) Error with non-numeric index column #21662

kevinislas2 opened this issue Jun 28, 2018 · 3 comments · Fixed by #25394
Labels
Docs Error Reporting Incorrect or improved errors from pandas good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@kevinislas2
Copy link

I'm working with the Series.interpolate function and I noticed that a DataFrame's index column can cause some weird problems when using the quadratic method.

First example: trying to impute data with non-numeric index column crashes:

data = {'A': ["a", "b", "c", "d"], 'B': [0, 1, np.nan, 100]}
df = pd.DataFrame(data=data)
df.set_index("A", inplace=True)
s = pd.Series(df["B"])

s.interpolate(method="quadratic")

Raises the following error: 'TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' '

I know Pandas uses Scipy's quadratic interpolate method, and while this error is raised inside Scipy, I believe it is because Pandas expects a numeric index column to interpolate data when using a quadratic method and sends it to Scipy's method.

The previous code runs without errors by not using the quadratic method.

The following code also runs without any errors:

data = {'A': [1, 2, 3, 4], 'B': [0, 1, np.nan, 100]}
df = pd.DataFrame(data=data)
df.set_index("A", inplace=True)
s = pd.Series(df["B"])

s.interpolate(method="quadratic")

outputs:

A
0      0.0
1      1.0
2     18.0
4    100.0
Name: B, dtype: float64

So while it makes sense to use the index column as an indicator of how many timesteps separate two values in a series, other methods seem to simply assume 1 between each row and avoid using the index column.

Two things I can think of that could be helpful is sending a more descriptive error message when the method receives a non-numeric index column or writing this condition in the docs, as I couldn't find anything about this error and solved it by tinkering with the index column.

Let me know if any of these ideas seem appropiate to prepare a PR.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.3
pytest: 3.2.1
pip: 10.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

I'm not sure we should assume that the user wants us to use range(N) as the interpolation values for object indexes. That's assuming things about ordering and equal-spacing that may not be correct.

other methods seem to simply assume 1 between each row and avoid using the index column.

What methods do you mean by 'other'? AFAIK, linear is the only one that ignores the index values.

sending a more descriptive error message when the method receives a non-numeric index

That seems reasonable.

@TomAugspurger TomAugspurger added Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Error Reporting Incorrect or improved errors from pandas Effort Low good first issue labels Jun 28, 2018
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Jun 28, 2018
@kevinislas2
Copy link
Author

What methods do you mean by 'other'? AFAIK, linear is the only one that ignores the index

You're right, Series.interpolate's documentation
states that all methods except linear use index values:

‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘polynomial’ is passed to scipy.interpolate.interp1d. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method=’polynomial’, order=4). These use the actual numerical values of the index.

I misread and thought that the last line was referring only to polynomial and spline, though now that I think about it it doesn't make sense that the other methods wouldn't.

@jreback jreback modified the milestones: Next Major Release, 0.24.0 Jul 6, 2018
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Oct 11, 2018
@TrigonaMinima
Copy link

I'll start working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Error Reporting Incorrect or improved errors from pandas good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
4 participants