Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series.all much slower than Series.values.all #26032

Closed
ericstarr opened this issue Apr 9, 2019 · 5 comments · Fixed by #52381
Closed

Series.all much slower than Series.values.all #26032

ericstarr opened this issue Apr 9, 2019 · 5 comments · Fixed by #52381
Labels
Performance Memory or execution speed performance

Comments

@ericstarr
Copy link

Code Sample, a copy-pastable example if possible

# Series of bools

s = pd.Series(np.random.randint(0, 2, 100000)).astype(bool)

# ~1.45 ms
%timeit s.any(skipna=True)

# ~1.35 ms
%timeit s.any(skipna=False)

# ~6.5 us - Note that I get a message about possible caching, but
# even after multiplying by worst case multiplier, still an order of
# magnitude faster than s.any()
%timeit s.values.any()


# Series of ints

s2 = pd.Series(np.random.randint(0, 2, 100000))

# ~330 us
%timeit s2.any(skipna=True)

# ~280 us
%timeit s2.any(skipna=False)

# ~90 us - No possible caching warning on this one
%timeit s2.values.any()

Problem description

Calling Series.any is much slower than calling Series.values.any on a series of bools
Interestingly, calling Series.any on a series of ints is quite a bit faster than on a series of bools, though even if it is a series of ints, Series.values.any is still faster.

I ran with both skipna=True and skipna=False in case it was an issue of how NaNs are being handled.

I see the same time differences with Series.all

Expected Output

I would expect the performance to be comparable. Maybe not exactly the same,, but not order(s) of magnitude slower.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 45 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.3
pytest: 3.3.0
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.1
numpy: 1.11.1
scipy: 0.18.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.1
feather: None
matplotlib: 1.5.1
openpyxl: 2.5.6
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.4
bs4: 4.5.1
html5lib: 0.9999999
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: 0.0.8
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@chris-b1
Copy link
Contributor

chris-b1 commented Apr 9, 2019

You're welcome to have a look here - ultimately it will be due to overhead of unpacking the array to deal with missing values, but might be something that can be elided

def _get_values(values, skipna, fill_value=None, fill_value_typ=None,

@chris-b1 chris-b1 added the Performance Memory or execution speed performance label Apr 9, 2019
@chris-b1 chris-b1 added this to the Contributions Welcome milestone Apr 9, 2019
@ericstarr
Copy link
Author

I spent a little time looking into this and don't see an obvious way to make it faster. Potentially in the case where the dtype of the Series is bool and skipna is True you could bypass the _get_values call and go straight to df.values.any, but that seems like a messy way of dealing with this.

@jreback
Copy link
Contributor

jreback commented Apr 10, 2019

this is already being address in #25070

cc @qwhelan

@qwhelan
Copy link
Contributor

qwhelan commented Apr 10, 2019

I fixed the root cause in numpy with numpy/numpy#12988 so 1.17 should be about 250x faster in this regard.

@arw2019
Copy link
Member

arw2019 commented Sep 24, 2020

Below are the results on 1.2 master. In terms of the difference booleans now outperform ints by ~4x for Series.any and ~25x for Series.values.any

In [7]: s = pd.Series(np.random.randint(0, 2, 100000)).astype(bool) 

In [8]: %timeit s.any(skipna=True)                                                                     
16.8 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [9]: %timeit s.any(skipna=False) 
16.3 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [10]: %timeit s.values.any() 
2.19 µs ± 43.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [11]: s2 = pd.Series(np.random.randint(0, 2, 100000)) 

In [12]: %timeit s2.any(skipna=True) 
71.6 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %timeit s2.any(skipna=False) 
68.4 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [14]: %timeit s2.values.any() 
53.4 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
This was referenced Apr 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants