ENH: Add numba engine to several rolling aggregations #38895

mroeschke · 2021-01-02T02:06:51Z

tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Adds engine and engine_kwargs argument to mean, median, sum, min, max

…_aggs

…ctions

…_aggs

jreback

do we have asv's that exercise both engines? can you show results

pandas/core/window/rolling.py

jreback · 2021-01-03T16:32:25Z

doc/source/whatsnew/v1.3.0.rst

@@ -51,6 +51,7 @@ Other enhancements
 - :func:`pandas.read_sql_query` now accepts a ``dtype`` argument to cast the columnar data from the SQL database based on user input (:issue:`10285`)
 - Improved integer type mapping from pandas to SQLAlchemy when using :meth:`DataFrame.to_sql` (:issue:`35076`)
 - :func:`to_numeric` now supports downcasting of nullable ``ExtensionDtype`` objects (:issue:`33013`)
+- :meth:`.Rolling.sum`, :meth:`.Expanding.sum`, :meth:`.Rolling.mean`, :meth:`.Expanding.mean`, :meth:`.Rolling.median`, :meth:`.Expanding.median`, :meth:`.Rolling.max`, :meth:`.Expanding.max`, :meth:`.Rolling.min`, and :meth:`.Expanding.min` now support ``Numba`` execution with the ``engine`` keyword (:issue:`38895`)


might be worth a note in the user docs as well

…_aggs

mroeschke · 2021-01-04T07:16:36Z

Here's a simple benchmark

In [13]: df = pd.DataFrame((100 * np.random.random((N, 20))))

In [14]: roll_df = df.rolling(10)

In [15]: roll_df.mean(engine="numba", engine_kwargs={"parallel": True, "nogil": True})

Out[15]:
                0          1          2          3          4          5          6   ...         13         14         15         16         17         18         19
0              NaN        NaN        NaN        NaN        NaN        NaN        NaN  ...        NaN        NaN        NaN        NaN        NaN        NaN        NaN
1              NaN        NaN        NaN        NaN        NaN        NaN        NaN  ...        NaN        NaN        NaN        NaN        NaN        NaN        NaN
2              NaN        NaN        NaN        NaN        NaN        NaN        NaN  ...        NaN        NaN        NaN        NaN        NaN        NaN        NaN
3              NaN        NaN        NaN        NaN        NaN        NaN        NaN  ...        NaN        NaN        NaN        NaN        NaN        NaN        NaN
4              NaN        NaN        NaN        NaN        NaN        NaN        NaN  ...        NaN        NaN        NaN        NaN        NaN        NaN        NaN
...            ...        ...        ...        ...        ...        ...        ...  ...        ...        ...        ...        ...        ...        ...        ...
9999995  53.109393  44.602854  44.822229  42.651840  52.167235  53.439720  34.879703  ...  49.400177  57.872117  30.206089  59.354336  38.903916  45.369952  36.430867
9999996  50.362627  49.842483  42.688050  36.641871  48.515543  53.232805  33.087431  ...  47.096377  59.415241  27.924127  62.007697  33.320770  41.012531  34.111801
9999997  52.901644  47.866224  42.234959  39.333112  53.802062  55.045552  36.034986  ...  48.406358  57.742796  27.194332  61.915400  33.759142  45.622714  35.858745
9999998  46.061426  53.639151  36.709441  44.440904  51.782057  48.010830  38.226252  ...  48.533150  62.545091  33.294572  65.169238  37.093948  44.830588  37.272684
9999999  40.147146  54.386060  43.861967  53.070330  53.403055  44.737140  42.638428  ...  49.509350  57.838767  33.224957  65.508094  35.911931  50.585150  36.818871

[10000000 rows x 20 columns]

In [16]: %timeit roll_df.mean()
7.69 s ± 19.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit roll_df.mean(engine="numba", engine_kwargs={"parallel": True, "nogil": True})
5.97 s ± 45.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

jreback · 2021-01-04T13:30:16Z

asv_bench/benchmarks/rolling.py

    )
-    param_names = ["constructor", "dtype", "function", "engine"]
+    param_names = ["constructor", "dtype", "function", "engine", "method"]


are these benchmarks still reasonable in terms of total time, e.g. < 1s per

Each benchmark is around ~1s. Is that too much?

jreback · 2021-01-04T13:30:55Z

pandas/core/window/rolling.py

        nv.validate_window_func("sum", args, kwargs)
+        if maybe_use_numba(engine):
+            if self.method == "table":


might be worth a common function for this?

If I inline the args assignment it would essentially be a 1 line function.

Also I anticipate numba adding an axis argument incrementally so I would imagine some methods needing the if self.method == "table": and others not.

jreback · 2021-01-04T13:31:59Z

pandas/tests/window/test_numba.py

+                engine_kwargs=engine_kwargs, engine="numba"
+            )
+
+        #  Once method='table' is supported, uncomment test below.


could just xfail this

jreback · 2021-01-04T13:32:06Z

pandas/tests/window/test_numba.py

+                engine_kwargs=engine_kwargs, engine="numba"
+            )
+
+        #  Once method='table' is supported, uncomment test below.


…_aggs

jreback · 2021-01-04T21:11:38Z

thanks

as discussed you can write a function like that iterates over the first dimension and calls np.nan*

max-sixty · 2021-06-13T21:57:50Z

Hi @mroeschke — not sure this is the best forum, but saw this in the release notes and wanted to confirm whether my read of the code was correct.

I'm one of the developers of numbagg, and xarray (in which we use numbagg). (And in the olden days, I had occasional pandas contributions.)

IIUC, a rolling numba mean in pandas will call generate_numba_apply_func with func as np.mean. And then that will calculate a window on each step, and apply the numbagg function over the whole window at each step.

Is that correct? Do you have any thoughts on the relative efficiency of that vs. a "rolling algo"; i.e. one that keeps a running sum & count, adding one new value and subtracting one existing value at each step? I had thought that a rolling algo would be significantly faster — particularly for large windows — but I haven't tested it and perhaps you considered this already?

IIUC, the cython functions in pandas are rolling algos. And here's an example of that implemented with numba in numbagg: https://github.com/numbagg/numbagg/blob/v0.2.1/numbagg/moving.py#L85

Thanks in advance, and congrats on getting this into pandas!

mroeschke · 2021-06-14T17:06:20Z

You're understanding is correct @max-sixty; this implementation just calls np.mean over each window instead of tracking the sum and counts.

One of the perks about this implementation with numba over the sum and counts method is that the for loop over each window can be parallelized fine with numba which can yield nice performance gains. I don't think that the for looping that tracks sum and counts can be parallelized and yield correct results.

Without the parallelism, you're right that tracking sums and counts will be faster than the method I implemented here. I do still need to dig in a little more to see if these assumptions are correct.

max-sixty · 2021-06-14T20:07:12Z

Thanks @mroeschke . And ofc another benefit is that the implementation is much more general, and can take user-jittable functions.

I'd be interested to know how benchmarks look if you do run them!

max-sixty · 2021-06-14T20:11:38Z

One other point — though very low confidence — it looks like you had some issues with supplying an axis parameter to numba functions. In numbagg we use gufuncs, which then operate on any number of dimensions, and the axis is handled externally to the numba routine. In case that's helpful.

mroeschke added 16 commits December 30, 2020 15:36

Add engine arguments to methods that support numpy nan methods

db2f86d

Route arguments through apply

8a7bbed

Merge remote-tracking branch 'upstream/master' into enh/rolling_table…

61f6f89

…_aggs

Add docs

10dd6aa

Merge remote-tracking branch 'upstream/master' into enh/rolling_table…

0707c28

…_aggs

Correct std and var signature. Add sem to fixture

65607fc

realized numba does not support axis or ddof arguments in np.nan* fun…

2ff5fe2

…ctions

Move median func below

eaee1cc

fix commented code

cce6181

Remove numba engine from quantile

7edd140

Remove other arguments from quantile

344485f

Add numba engine tests single method test

4ce2a78

Change to assert_series_equal

353fab6

Add whatsnew note

33e0552

Add PR number

a56fe92

Merge remote-tracking branch 'upstream/master' into enh/rolling_table…

fee8e2e

…_aggs

mroeschke added this to the 1.3 milestone Jan 2, 2021

mroeschke added Enhancement Window rolling, ewma, expanding labels Jan 2, 2021

Remove redundant doc section

6bc1333

jreback requested changes Jan 3, 2021

View reviewed changes

pandas/core/window/rolling.py Show resolved Hide resolved

pandas/core/window/rolling.py Show resolved Hide resolved

jreback reviewed Jan 3, 2021

View reviewed changes

mroeschke added 5 commits January 3, 2021 18:21

Merge remote-tracking branch 'upstream/master' into enh/rolling_table…

6bc330b

…_aggs

Merge remote-tracking branch 'upstream/master' into enh/rolling_table…

76fc33f

…_aggs

Merge remote-tracking branch 'upstream/master' into enh/rolling_table…

481bfd4

…_aggs

Add ASV benchmarks

178543b

Add engine arg

d8582dd

Add note in user_guide

49bae51

jreback reviewed Jan 4, 2021

View reviewed changes

mroeschke added 2 commits January 4, 2021 09:28

Merge remote-tracking branch 'upstream/master' into enh/rolling_table…

f0e5e59

…_aggs

xfail instead of comment out

fc4656c

jreback approved these changes Jan 4, 2021

View reviewed changes

jreback merged commit df69f2a into pandas-dev:master Jan 4, 2021

mroeschke deleted the enh/rolling_table_aggs branch January 5, 2021 03:47

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

ENH: Add numba engine to several rolling aggregations (pandas-dev#38895)

0a74ca9

mroeschke mentioned this pull request Jul 26, 2021

POC: pandas shared kernels for mean | groupby mean | rolling mean twosigma/pandas#51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add numba engine to several rolling aggregations #38895

ENH: Add numba engine to several rolling aggregations #38895

mroeschke commented Jan 2, 2021

jreback left a comment

jreback Jan 3, 2021

mroeschke commented Jan 4, 2021

jreback Jan 4, 2021

mroeschke Jan 4, 2021

jreback Jan 4, 2021

mroeschke Jan 4, 2021

jreback Jan 4, 2021

jreback Jan 4, 2021

jreback commented Jan 4, 2021

max-sixty commented Jun 13, 2021 •

edited

Loading

mroeschke commented Jun 14, 2021

max-sixty commented Jun 14, 2021

max-sixty commented Jun 14, 2021

ENH: Add numba engine to several rolling aggregations #38895

ENH: Add numba engine to several rolling aggregations #38895

Conversation

mroeschke commented Jan 2, 2021

jreback left a comment

Choose a reason for hiding this comment

jreback Jan 3, 2021

Choose a reason for hiding this comment

mroeschke commented Jan 4, 2021

jreback Jan 4, 2021

Choose a reason for hiding this comment

mroeschke Jan 4, 2021

Choose a reason for hiding this comment

jreback Jan 4, 2021

Choose a reason for hiding this comment

mroeschke Jan 4, 2021

Choose a reason for hiding this comment

jreback Jan 4, 2021

Choose a reason for hiding this comment

jreback Jan 4, 2021

Choose a reason for hiding this comment

jreback commented Jan 4, 2021

max-sixty commented Jun 13, 2021 • edited Loading

mroeschke commented Jun 14, 2021

max-sixty commented Jun 14, 2021

max-sixty commented Jun 14, 2021

max-sixty commented Jun 13, 2021 •

edited

Loading