Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: min_weight in addition to min_periods for ewma #11167

Open
max-sixty opened this issue Sep 22, 2015 · 5 comments
Open

ENH: min_weight in addition to min_periods for ewma #11167

max-sixty opened this issue Sep 22, 2015 · 5 comments
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Window rolling, ewma, expanding

Comments

@max-sixty
Copy link
Contributor

Currently the exponential functions, such as pd.ewma, use a min_periods argument to ensure there's enough data to specify generate a valid value. While this works well for the rolling functions, it's not effective for exponential functions because points have weight forever, albeit ever decreasing:

In [4]: series=pd.Series(range(200))

In [5]: series[20:190]=pd.np.nan

In [6]: pd.ewma(series, span=10, min_periods=15)
Out[6]: 
0             NaN
1             NaN
2             NaN
3             NaN
4             NaN
5             NaN
6             NaN
7             NaN
8             NaN
9             NaN
10            NaN
11            NaN
12            NaN
13            NaN
14      10.277660
15      11.172348
16      12.080052
17      12.999407
18      13.929141
19      14.868084
20      14.868084
21      14.868084
22      14.868084
23      14.868084
24      14.868084
25      14.868084
26      14.868084
27      14.868084
28      14.868084
29      14.868084
          ...    
170     14.868084
171     14.868084
172     14.868084
173     14.868084
174     14.868084
175     14.868084
176     14.868084
177     14.868084
178     14.868084
179     14.868084
180     14.868084
181     14.868084
182     14.868084
183     14.868084
184     14.868084
185     14.868084
186     14.868084
187     14.868084
188     14.868084
189     14.868084
190    190.000000
191    190.550000
192    191.132890
193    191.748020
194    192.394502
195    193.071240
196    193.776953
197    194.510212
198    195.269468
199    196.053089
dtype: float64

I think what we want is to have a min_weight argument, so if you specify 0.5, it needs 50% of the weight in order to calculate a value. For rolling functions, this would be equivalent to min_periods being half of window.

What are people's thoughts?

@jreback
Copy link
Contributor

jreback commented Sep 24, 2015

what would you have the result look like given an some min_weight

@jreback jreback added the Numeric Operations Arithmetic, Comparison, and Logical operations label Sep 24, 2015
@max-sixty
Copy link
Contributor Author

Here's a bit of test code that does what I think this should do:

def filter_min_weights(series, min_weight):
    ewma_series = pd.ewma(series, span=10, min_periods=0)
    has_weight = pd.Series(0, index=series.index)
    has_weight[series.dropna()] = 1
    weights = pd.ewma(has_weight, span=10, min_periods=0)
    has_sufficient_weight = weights[weights>min_weight]
    return ewma_series.where(has_sufficient_weight)

Then:

In [128]: partial_series = pd.Series(range(200))

In [129]: partial_series[20:185]=pd.np.nan

In [130]: filter_min_weights(partial_series, min_weight=0.5)
Out[130]: 
0        0.000000
1        0.550000
2        1.132890
3        1.748020
4        2.394502
5        3.071240
6        3.776953
7        4.510212
8        5.269468
9        6.053089
10       6.859394
11       7.686679
12       8.533251
13       9.397448
14      10.277660
15      11.172348
16      12.080052
17      12.999407
18      13.929141
19      14.868084
20      14.868084
21      14.868084
22      14.868084
23            NaN
24            NaN
25            NaN
26            NaN
27            NaN
28            NaN
29            NaN
          ...    
170           NaN
171           NaN
172           NaN
173           NaN
174           NaN
175           NaN
176           NaN
177           NaN
178           NaN
179           NaN
180           NaN
181           NaN
182           NaN
183           NaN
184           NaN
185           NaN
186           NaN
187           NaN
188    186.748020
189    187.394502
190    188.071240
191    188.776953
192    189.510212
193    190.269468
194    191.053089
195    191.859394
196    192.686679
197    193.533251
198    194.397448
199    195.277660
dtype: float64

For clarity the weights here (*100) are:

Out[134]: 
0      100
1      100
2      100
3      100
4      100
5      100
6      100
7      100
8      100
9      100
10     100
11     100
12     100
13     100
14     100
15     100
16     100
17     100
18     100
19     100
20      81
21      66
22      54
23      44
24      36
25      29
26      24
27      19
28      16
29      13
      ... 
170      0
171      0
172      0
173      0
174      0
175      0
176      0
177      0
178      0
179      0
180      0
181      0
182      0
183      0
184      0
185     18
186     33
187     45
188     55
189     63
190     70
191     75
192     79
193     83
194     86
195     89
196     91
197     92
198     93
199     95
dtype: int64

(apologies for the slow reply)

@max-sixty
Copy link
Contributor Author

Anyone have any thoughts here? (do you know who the pandas experts on this stuff are @jreback?)

I think this is a better way, with some confidence and - for once - this is my area of expertise outside of pandas. But I think it only makes sense to build this if there's some consensus. (and after #11603)

Or let me know if my example is unclear / there are any Qs

@jreback
Copy link
Contributor

jreback commented Nov 22, 2015

cc @seth-p

@seth-p
Copy link
Contributor

seth-p commented Nov 25, 2015

I think this is a good idea, though probably makes sense only when ignore_na=False (the default). I guess you could implement it for ignore_na=True as well, though in that case the most it will do is produce NaN for an initial stretch of entries.

Obviously for backwards compatibility I would keep min_periods, and when both are specified produce non-NaN values only when both conditions are satisfied.

@jreback jreback added Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 30, 2016
@jreback jreback added this to the Next Major Release milestone Aug 18, 2017
@jbrockmendel jbrockmendel added the Window rolling, ewma, expanding label Dec 18, 2019
@mroeschke mroeschke removed Numeric Operations Arithmetic, Comparison, and Logical operations Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 8, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants