Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methods min and max give NaN in time-aware rolling window even if min_periods=1 #15901

Closed
albertvillanova opened this issue Apr 5, 2017 · 16 comments
Labels
Bug Datetime Datetime data dtype Reshaping Concat, Merge/Join, Stack/Unstack, Explode Window rolling, ewma, expanding

Comments

@albertvillanova
Copy link
Contributor

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({'a': [None, 2, 3]}, index=pd.to_datetime(['20170403', '20170404', '20170405']))

df.rolling('3d', min_periods=1)['a'].sum()

df.rolling('3d', min_periods=1)['a'].min()
df.rolling('3d', min_periods=1)['a'].max()

Problem description

Even if we set min_periods=1, the functions min and max give NaN if there is one NaN value inside the time-aware rolling window.

However, there is no bug when the window width is fixed (not a time period):

In [397]: df.rolling(3, min_periods=1)['a'].min()
Out[397]:
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.0
Name: a, dtype: float64

Expected Output

The expected output, analogously to the one given by the function sum, should be a non-NaN value if at least there is a non-NaN value inside the rolling window.

In [397]: df.rolling('3d', min_periods=1)['a'].min()
Out[397]:
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.0
Name: a, dtype: float64

In [397]: df.rolling('3d', min_periods=1)['a'].min()
Out[397]:
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    3.0
Name: a, dtype: float64

Output of pd.show_versions()

commit: None python: 3.4.5.final.0 python-bits: 64 OS: Linux OS-release: 2.6.32-431.29.2.el6.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8.1
boto: 2.45.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Apr 5, 2017

min_periods : int, default None
    Minimum number of observations in window required to have a value
    (otherwise result is NA). For a window that is specified by an offset,
    this will default to 1.

you are specifying a window by an offset. So what exactly would min_periods=1 actually mean?

It is essentially not implemented. I guess the docs could be better.

cc @chrisaycock

@jreback
Copy link
Contributor

jreback commented Apr 5, 2017

I think you actually want something like min_count (similar to #11167).

or min_periods could actually take an offset. (e.g. 1s), but again what would that actually mean?

@albertvillanova
Copy link
Contributor Author

The meaning of min_periods, independently of the type of window (either of fixed width indicated by an integer, or temporal width indicated by an offset), is the minimum number of non-NaN values that must exist inside the window in order to perform the function evaluation ignoring the other NaNs inside the window; otherwise, return NaN.

Note that min_periods works fine with an offset for the other functions, like sum:

In [403]: df.rolling('3d', min_periods=1)['a'].sum()
Out[403]:
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    5.0
Name: a, dtype: float64

In [404]: df.rolling('3d', min_periods=2)['a'].sum()
Out[404]:
2017-04-03    NaN
2017-04-04    NaN
2017-04-05    5.0
Name: a, dtype: float64

In [405]: df.rolling('3d', min_periods=3)['a'].sum()
Out[405]:
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

@chrisaycock
Copy link
Contributor

chrisaycock commented Apr 5, 2017

I have some questions of my own. pandas by default excludes NaN and numpy includes it:

In [35]: df.a.min()
Out[35]: 2.0

In [36]: df.a.values.min()
Out[36]: nan

But then for some reason, calling numpy as a stand-alone function excludes the NaN, which seems to contradict their docs:

In [37]: np.min(df.a)
Out[37]: 2.0

And if I try their version that explicitly excludes NaN, I get back a Series instead of a scalar!

In [38]: np.nanmin(df.a)
Out[38]:
2017-04-03    2.0
2017-04-04    2.0
2017-04-05    2.0
Name: a, dtype: float64

So it seems there are lots of unexpected results here.

@albertvillanova
Copy link
Contributor Author

albertvillanova commented Apr 5, 2017

@chrisaycock Concerning your first question,

Forgetting offsets for moment, why does min_period cause this to have a different value?

Out[23]:
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

In [24]: df.rolling(3, min_periods=1)['a'].min()
Out[24]:
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.0
Name: a, dtype: float64

for a fixed width rolling window (specified by an integer), the default value for the parameter min_periods is the width of the window.

These are equivalent:

In [406]: df.rolling(3)['a'].min()
Out[406]:
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

In [407]: df.rolling(3, min_periods=3)['a'].min()
Out[407]:
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

@albertvillanova
Copy link
Contributor Author

albertvillanova commented Apr 5, 2017

@chrisaycock For the other questions, you are passing a Pandas Series as an argument to Numpy functions, which expect an array or an ndarray.

If you use the Pandas Series attribute .values, you get a Numpy ndarray and Numpy functions give the expected results:

In [23]: np.min(df.a.values)
Out[23]: nan

In [24]: np.nanmin(df.a.values)
Out[24]: 2.0

Nevertheless, I think this is a digression with respect to the original issue: Pandas min and max functions (contrary to sum and others) do not give the expected output when there is a NaN within a time-aware (specified by a time offset) rolling window.

@jreback
Copy link
Contributor

jreback commented Apr 5, 2017

oh so this works for the numeric ones just not min. max with an offset?

if that is the case it is a bug

@albertvillanova
Copy link
Contributor Author

@jreback This is the output for other functions:

In [30]: df.rolling('3d')['a'].sum()
Out[30]: 
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    5.0
Name: a, dtype: float64

In [31]: df.rolling('3d')['a'].count()
Out[31]: 
2017-04-03    NaN
2017-04-04    1.0
2017-04-05    2.0
Name: a, dtype: float64

In [32]: df.rolling('3d')['a'].mean()
Out[32]: 
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.5
Name: a, dtype: float64

In [33]: df.rolling('3d')['a'].median()
Out[33]: 
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.5
Name: a, dtype: float64

whereas this is the output for the functions min and max:

In [34]: df.rolling('3d')['a'].min()
Out[34]: 
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

In [35]: df.rolling('3d')['a'].max()
Out[35]: 
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

@jorisvandenbossche
Copy link
Member

It is maybe not due to the min_periods, but rather the min/max function implementation, as doing it with an apply, you get the expected result:

In [14]: df.rolling('3d', min_periods=1)['a'].apply(lambda x: np.nanmin(x))
Out[14]: 
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.0
Name: a, dtype: float64

But I agree with @albertvillanova, this is certainly a bug.

@chrisaycock

But then for some reason, calling numpy as a stand-alone function excludes the NaN

That is because if you do np.min(series), under the hood it will check if the series object has a min method, and use that. So that actually uses series.min(), hence the confusing result if you expected numpy nan semantics.

@chrisaycock
Copy link
Contributor

My point was the inconsistent nanmin, which apparently has been reported before as #8383 and numpy/numpy#5114.

Regarding this particular issue, yes, min/max should have the same behavior as sum. The min_periods is a red herring.

@jreback
Copy link
Contributor

jreback commented Apr 6, 2017

ok if someone wants to take a crack at this, have at it.

@zhangzhengxin
Copy link

Was it fixed?

@WillAyd WillAyd added the Window rolling, ewma, expanding label Oct 5, 2018
@ihsansecer
Copy link
Contributor

@jreback this is working fine in 0.24.x

@jreback
Copy link
Contributor

jreback commented Jun 22, 2019

@ihsansecer how is this on master?

if working so we have a test for this?

@ihsansecer
Copy link
Contributor

ihsansecer commented Jun 22, 2019

@jreback working fine too. min_periods is tested for min and max but it doesn't address this issue.

correction: it is tested here

@jreback
Copy link
Contributor

jreback commented Jun 22, 2019

thanks for checking @ihsansecer

@jreback jreback closed this as completed Jun 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Reshaping Concat, Merge/Join, Stack/Unstack, Explode Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

7 participants