Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use_inf_as_null makes fillna extremely slow #12257

Closed
Winand opened this issue Feb 8, 2016 · 4 comments
Closed

use_inf_as_null makes fillna extremely slow #12257

Winand opened this issue Feb 8, 2016 · 4 comments
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance

Comments

@Winand
Copy link
Contributor

Winand commented Feb 8, 2016

Setting mode.use_inf_as_null=True makes fillna extremely slow (about 14sec).

The most of time is spent on self._data.fillna (core/generic.py:2833)

import pandas as pd
pd.set_option("mode.use_inf_as_null", True) #+/-inf -> None
s=pd.read_msgpack(r"D:\slow_fillna.msgpack", encoding='utf-8')
s2=s.fillna("<Н/Д>")

slow_fillna.msgpack.gz

python: 3.4.3.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en

pandas: 0.17.1
nose: 1.3.7
pip: 7.1.2
setuptools: 18.7.1
Cython: 0.23.4
numpy: 1.9.3
@jreback
Copy link
Contributor

jreback commented Feb 8, 2016

pls show a self-contained reproducible example

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance labels Feb 8, 2016
@cemsbr
Copy link
Contributor

cemsbr commented Sep 16, 2016

Maybe this is a more general problem. I noticed that it makes dropna() slow, too. It is much faster to replace([np.inf, -np.inf], np.nan).dropna(). It happens every time with my data, but it is difficult to show a self-contained reproducible example.

@apiszcz
Copy link

apiszcz commented Aug 22, 2019

I am seeing the same behavior, 30 million rows taking 5 minutes to run the replace.
The data frame has mixed types, less than 30 columns. There is something strange happening.

Making a simple example with a single type and few more columns does not demonstrate the issue.

a1=np.zeros((10000000,1))
a1[:,:]=np.inf
df=pd.DataFrame(a1)
%time df.replace([np.inf, -np.inf], np.nan)

13mS

a1=np.zeros((10000000,1))
a1[:,:]=np.inf
df=pd.DataFrame(a1)
df['a1']=''
df['n1']=0.0
%time df.replace([np.inf, -np.inf], np.nan)

841mS

@jbrockmendel
Copy link
Member

closed by #53494

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants