-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Series.all much slower than Series.values.all #26032
Comments
You're welcome to have a look here - ultimately it will be due to overhead of unpacking the array to deal with missing values, but might be something that can be elided Line 204 in 2f6b90a
|
I spent a little time looking into this and don't see an obvious way to make it faster. Potentially in the case where the dtype of the Series is bool and skipna is True you could bypass the _get_values call and go straight to df.values.any, but that seems like a messy way of dealing with this. |
I fixed the root cause in numpy with numpy/numpy#12988 so 1.17 should be about 250x faster in this regard. |
Below are the results on 1.2 master. In terms of the difference booleans now outperform ints by ~4x for In [7]: s = pd.Series(np.random.randint(0, 2, 100000)).astype(bool)
In [8]: %timeit s.any(skipna=True)
16.8 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [9]: %timeit s.any(skipna=False)
16.3 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [10]: %timeit s.values.any()
2.19 µs ± 43.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [11]: s2 = pd.Series(np.random.randint(0, 2, 100000))
In [12]: %timeit s2.any(skipna=True)
71.6 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [13]: %timeit s2.any(skipna=False)
68.4 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [14]: %timeit s2.values.any()
53.4 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) |
Code Sample, a copy-pastable example if possible
Problem description
Calling Series.any is much slower than calling Series.values.any on a series of bools
Interestingly, calling Series.any on a series of ints is quite a bit faster than on a series of bools, though even if it is a series of ints, Series.values.any is still faster.
I ran with both skipna=True and skipna=False in case it was an issue of how NaNs are being handled.
I see the same time differences with Series.all
Expected Output
I would expect the performance to be comparable. Maybe not exactly the same,, but not order(s) of magnitude slower.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 45 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.3
pytest: 3.3.0
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.1
numpy: 1.11.1
scipy: 0.18.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.1
feather: None
matplotlib: 1.5.1
openpyxl: 2.5.6
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.4
bs4: 4.5.1
html5lib: 0.9999999
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: 0.0.8
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: