Methods min and max give NaN in time-aware rolling window even if min_periods=1 #15901

albertvillanova · 2017-04-05T13:25:40Z

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({'a': [None, 2, 3]}, index=pd.to_datetime(['20170403', '20170404', '20170405']))

df.rolling('3d', min_periods=1)['a'].sum()

df.rolling('3d', min_periods=1)['a'].min()
df.rolling('3d', min_periods=1)['a'].max()

Problem description

Even if we set min_periods=1, the functions min and max give NaN if there is one NaN value inside the time-aware rolling window.

However, there is no bug when the window width is fixed (not a time period):

In [397]: df.rolling(3, min_periods=1)['a'].min()
Out[397]:
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.0
Name: a, dtype: float64

Expected Output

The expected output, analogously to the one given by the function sum, should be a non-NaN value if at least there is a non-NaN value inside the rolling window.

In [397]: df.rolling('3d', min_periods=1)['a'].min()
Out[397]:
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.0
Name: a, dtype: float64

In [397]: df.rolling('3d', min_periods=1)['a'].min()
Out[397]:
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    3.0
Name: a, dtype: float64

Output of `pd.show_versions()`

commit: None python: 3.4.5.final.0 python-bits: 64 OS: Linux OS-release: 2.6.32-431.29.2.el6.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8.1
boto: 2.45.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-04-05T14:17:02Z

min_periods : int, default None
    Minimum number of observations in window required to have a value
    (otherwise result is NA). For a window that is specified by an offset,
    this will default to 1.

you are specifying a window by an offset. So what exactly would min_periods=1 actually mean?

It is essentially not implemented. I guess the docs could be better.

cc @chrisaycock

jreback · 2017-04-05T14:18:49Z

I think you actually want something like min_count (similar to #11167).

or min_periods could actually take an offset. (e.g. 1s), but again what would that actually mean?

albertvillanova · 2017-04-05T15:11:45Z

The meaning of min_periods, independently of the type of window (either of fixed width indicated by an integer, or temporal width indicated by an offset), is the minimum number of non-NaN values that must exist inside the window in order to perform the function evaluation ignoring the other NaNs inside the window; otherwise, return NaN.

Note that min_periods works fine with an offset for the other functions, like sum:

In [403]: df.rolling('3d', min_periods=1)['a'].sum()
Out[403]:
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    5.0
Name: a, dtype: float64

In [404]: df.rolling('3d', min_periods=2)['a'].sum()
Out[404]:
2017-04-03    NaN
2017-04-04    NaN
2017-04-05    5.0
Name: a, dtype: float64

In [405]: df.rolling('3d', min_periods=3)['a'].sum()
Out[405]:
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

chrisaycock · 2017-04-05T15:27:34Z

I have some questions of my own. pandas by default excludes NaN and numpy includes it:

In [35]: df.a.min()
Out[35]: 2.0

In [36]: df.a.values.min()
Out[36]: nan

But then for some reason, calling numpy as a stand-alone function excludes the NaN, which seems to contradict their docs:

In [37]: np.min(df.a)
Out[37]: 2.0

And if I try their version that explicitly excludes NaN, I get back a Series instead of a scalar!

In [38]: np.nanmin(df.a)
Out[38]:
2017-04-03    2.0
2017-04-04    2.0
2017-04-05    2.0
Name: a, dtype: float64

So it seems there are lots of unexpected results here.

albertvillanova · 2017-04-05T15:54:30Z

@chrisaycock Concerning your first question,

Forgetting offsets for moment, why does min_period cause this to have a different value?

Out[23]:
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

In [24]: df.rolling(3, min_periods=1)['a'].min()
Out[24]:
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.0
Name: a, dtype: float64

for a fixed width rolling window (specified by an integer), the default value for the parameter min_periods is the width of the window.

These are equivalent:

In [406]: df.rolling(3)['a'].min()
Out[406]:
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

In [407]: df.rolling(3, min_periods=3)['a'].min()
Out[407]:
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

albertvillanova · 2017-04-05T22:08:33Z

@chrisaycock For the other questions, you are passing a Pandas Series as an argument to Numpy functions, which expect an array or an ndarray.

If you use the Pandas Series attribute .values, you get a Numpy ndarray and Numpy functions give the expected results:

In [23]: np.min(df.a.values)
Out[23]: nan

In [24]: np.nanmin(df.a.values)
Out[24]: 2.0

Nevertheless, I think this is a digression with respect to the original issue: Pandas min and max functions (contrary to sum and others) do not give the expected output when there is a NaN within a time-aware (specified by a time offset) rolling window.

jreback · 2017-04-05T22:13:04Z

oh so this works for the numeric ones just not min. max with an offset?

if that is the case it is a bug

albertvillanova · 2017-04-06T05:39:05Z

@jreback This is the output for other functions:

In [30]: df.rolling('3d')['a'].sum()
Out[30]: 
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    5.0
Name: a, dtype: float64

In [31]: df.rolling('3d')['a'].count()
Out[31]: 
2017-04-03    NaN
2017-04-04    1.0
2017-04-05    2.0
Name: a, dtype: float64

In [32]: df.rolling('3d')['a'].mean()
Out[32]: 
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.5
Name: a, dtype: float64

In [33]: df.rolling('3d')['a'].median()
Out[33]: 
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.5
Name: a, dtype: float64

whereas this is the output for the functions min and max:

In [34]: df.rolling('3d')['a'].min()
Out[34]: 
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

In [35]: df.rolling('3d')['a'].max()
Out[35]: 
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

jorisvandenbossche · 2017-04-06T07:18:47Z

It is maybe not due to the min_periods, but rather the min/max function implementation, as doing it with an apply, you get the expected result:

In [14]: df.rolling('3d', min_periods=1)['a'].apply(lambda x: np.nanmin(x))
Out[14]: 
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.0
Name: a, dtype: float64

But I agree with @albertvillanova, this is certainly a bug.

@chrisaycock

But then for some reason, calling numpy as a stand-alone function excludes the NaN

That is because if you do np.min(series), under the hood it will check if the series object has a min method, and use that. So that actually uses series.min(), hence the confusing result if you expected numpy nan semantics.

chrisaycock · 2017-04-06T14:15:40Z

My point was the inconsistent nanmin, which apparently has been reported before as #8383 and numpy/numpy#5114.

Regarding this particular issue, yes, min/max should have the same behavior as sum. The min_periods is a red herring.

jreback · 2017-04-06T15:11:43Z

ok if someone wants to take a crack at this, have at it.

zhangzhengxin · 2018-09-18T13:38:38Z

Was it fixed?

ihsansecer · 2019-06-22T21:31:44Z

@jreback this is working fine in 0.24.x

jreback · 2019-06-22T21:35:28Z

@ihsansecer how is this on master?

if working so we have a test for this?

ihsansecer · 2019-06-22T22:19:57Z

@jreback working fine too. min_periods is tested for min and max but it doesn't address this issue.

correction: it is tested here

jreback · 2019-06-22T22:54:24Z

thanks for checking @ihsansecer

jorisvandenbossche added the Bug label Apr 6, 2017

WillAyd added the Window rolling, ewma, expanding label Oct 5, 2018

jreback closed this as completed Jun 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methods min and max give NaN in time-aware rolling window even if min_periods=1 #15901

Methods min and max give NaN in time-aware rolling window even if min_periods=1 #15901

albertvillanova commented Apr 5, 2017

jreback commented Apr 5, 2017

jreback commented Apr 5, 2017 •

edited

Loading

albertvillanova commented Apr 5, 2017

chrisaycock commented Apr 5, 2017 •

edited

Loading

albertvillanova commented Apr 5, 2017 •

edited

Loading

albertvillanova commented Apr 5, 2017 •

edited

Loading

jreback commented Apr 5, 2017

albertvillanova commented Apr 6, 2017

jorisvandenbossche commented Apr 6, 2017

chrisaycock commented Apr 6, 2017

jreback commented Apr 6, 2017

zhangzhengxin commented Sep 18, 2018

ihsansecer commented Jun 22, 2019

jreback commented Jun 22, 2019

ihsansecer commented Jun 22, 2019 •

edited

Loading

jreback commented Jun 22, 2019

Methods min and max give NaN in time-aware rolling window even if min_periods=1 #15901

Methods min and max give NaN in time-aware rolling window even if min_periods=1 #15901

Comments

albertvillanova commented Apr 5, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Apr 5, 2017

jreback commented Apr 5, 2017 • edited Loading

albertvillanova commented Apr 5, 2017

chrisaycock commented Apr 5, 2017 • edited Loading

albertvillanova commented Apr 5, 2017 • edited Loading

albertvillanova commented Apr 5, 2017 • edited Loading

jreback commented Apr 5, 2017

albertvillanova commented Apr 6, 2017

jorisvandenbossche commented Apr 6, 2017

chrisaycock commented Apr 6, 2017

jreback commented Apr 6, 2017

zhangzhengxin commented Sep 18, 2018

ihsansecer commented Jun 22, 2019

jreback commented Jun 22, 2019

ihsansecer commented Jun 22, 2019 • edited Loading

jreback commented Jun 22, 2019

Output of `pd.show_versions()`

jreback commented Apr 5, 2017 •

edited

Loading

chrisaycock commented Apr 5, 2017 •

edited

Loading

albertvillanova commented Apr 5, 2017 •

edited

Loading

albertvillanova commented Apr 5, 2017 •

edited

Loading

ihsansecer commented Jun 22, 2019 •

edited

Loading