Series.all much slower than Series.values.all #26032

ericstarr · 2019-04-09T15:22:32Z

Code Sample, a copy-pastable example if possible

# Series of bools

s = pd.Series(np.random.randint(0, 2, 100000)).astype(bool)

# ~1.45 ms
%timeit s.any(skipna=True)

# ~1.35 ms
%timeit s.any(skipna=False)

# ~6.5 us - Note that I get a message about possible caching, but
# even after multiplying by worst case multiplier, still an order of
# magnitude faster than s.any()
%timeit s.values.any()


# Series of ints

s2 = pd.Series(np.random.randint(0, 2, 100000))

# ~330 us
%timeit s2.any(skipna=True)

# ~280 us
%timeit s2.any(skipna=False)

# ~90 us - No possible caching warning on this one
%timeit s2.values.any()

Problem description

Calling Series.any is much slower than calling Series.values.any on a series of bools
Interestingly, calling Series.any on a series of ints is quite a bit faster than on a series of bools, though even if it is a series of ints, Series.values.any is still faster.

I ran with both skipna=True and skipna=False in case it was an issue of how NaNs are being handled.

I see the same time differences with Series.all

Expected Output

I would expect the performance to be comparable. Maybe not exactly the same,, but not order(s) of magnitude slower.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 45 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.3
pytest: 3.3.0
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.1
numpy: 1.11.1
scipy: 0.18.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.1
feather: None
matplotlib: 1.5.1
openpyxl: 2.5.6
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.4
bs4: 4.5.1
html5lib: 0.9999999
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: 0.0.8
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2019-04-09T20:01:46Z

You're welcome to have a look here - ultimately it will be due to overhead of unpacking the array to deal with missing values, but might be something that can be elided

pandas/pandas/core/nanops.py

Line 204 in 2f6b90a

def _get_values(values, skipna, fill_value=None, fill_value_typ=None,

ericstarr · 2019-04-10T13:59:03Z

I spent a little time looking into this and don't see an obvious way to make it faster. Potentially in the case where the dtype of the Series is bool and skipna is True you could bypass the _get_values call and go straight to df.values.any, but that seems like a messy way of dealing with this.

jreback · 2019-04-10T14:18:16Z

this is already being address in #25070

cc @qwhelan

qwhelan · 2019-04-10T15:50:16Z

I fixed the root cause in numpy with numpy/numpy#12988 so 1.17 should be about 250x faster in this regard.

arw2019 · 2020-09-24T03:27:39Z

Below are the results on 1.2 master. In terms of the difference booleans now outperform ints by ~4x for Series.any and ~25x for Series.values.any

In [7]: s = pd.Series(np.random.randint(0, 2, 100000)).astype(bool) 

In [8]: %timeit s.any(skipna=True)                                                                     
16.8 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [9]: %timeit s.any(skipna=False) 
16.3 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [10]: %timeit s.values.any() 
2.19 µs ± 43.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [11]: s2 = pd.Series(np.random.randint(0, 2, 100000)) 

In [12]: %timeit s2.any(skipna=True) 
71.6 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %timeit s2.any(skipna=False) 
68.4 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [14]: %timeit s2.values.any() 
53.4 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

chris-b1 added the Performance Memory or execution speed performance label Apr 9, 2019

chris-b1 added this to the Contributions Welcome milestone Apr 9, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

This was referenced Apr 1, 2023

PERF: Series.any #52341

Merged

PERF: Series.any/all #52381

Merged

mroeschke closed this as completed in #52381 Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Series.all much slower than Series.values.all #26032

Series.all much slower than Series.values.all #26032

ericstarr commented Apr 9, 2019

INSTALLED VERSIONS

chris-b1 commented Apr 9, 2019

ericstarr commented Apr 10, 2019

jreback commented Apr 10, 2019

qwhelan commented Apr 10, 2019

arw2019 commented Sep 24, 2020

Series.all much slower than Series.values.all #26032

Series.all much slower than Series.values.all #26032

Comments

ericstarr commented Apr 9, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

chris-b1 commented Apr 9, 2019

ericstarr commented Apr 10, 2019

jreback commented Apr 10, 2019

qwhelan commented Apr 10, 2019

arw2019 commented Sep 24, 2020

Output of `pd.show_versions()`