Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix overflow bugs in date_Range #24255

Merged
merged 12 commits into from
Dec 23, 2018
Merged

Conversation

jbrockmendel
Copy link
Member

Fixing this turned out to be a small beast. The overflows of interest fall into three categories. First: overflows when computing periods * stride that ATM are not caught, so give incorrect results:

>>> pd.date_range('1969-05-04', periods=200000000, freq='30000D')
DatetimeIndex(['1969-05-04', '2051-06-23', '2133-08-12'], dtype='datetime64[ns]', freq='30000D')

>>> dti = pd.date_range(start='1677-09-22', end='2262-04-11', freq='D')
>>> pd.date_range(start=dti[0], periods=len(dti), freq='D')
DatetimeIndex([], dtype='datetime64[ns]', freq='D')

Second, overflows when periods * stride overflow the int64 bounds but not uint64 bounds, which in some cases are recoverable. The following should not raise:

>>> dti = pd.date_range(start='1677-09-22', end='2262-04-11', freq='D')
>>> pd.date_range(start=dti[0], periods=len(dti)-1, freq='D')
[...]
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Cannot generate range with start=-9223286400000000000 and periods=213502

Last is a similar one with a negative stride where we can recover, but not by converting to uint64:

>>> dti = pd.date_range(start='1677-09-22', end='2262-04-11', freq='D')
>>> pd.date_range(end=dti[-1], periods=len(dti), freq='D')
[...]
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Cannot generate range with end=9223372800000000000 and periods=213503

@pep8speaks
Copy link

pep8speaks commented Dec 12, 2018

Hello @jbrockmendel! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 23, 2018 at 16:48 Hours UTC

@codecov
Copy link

codecov bot commented Dec 12, 2018

Codecov Report

Merging #24255 into master will decrease coverage by <.01%.
The diff coverage is 92.3%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24255      +/-   ##
==========================================
- Coverage   92.21%   92.21%   -0.01%     
==========================================
  Files         162      162              
  Lines       51768    51806      +38     
==========================================
+ Hits        47739    47773      +34     
- Misses       4029     4033       +4
Flag Coverage Δ
#multiple 90.61% <92.3%> (-0.01%) ⬇️
#single 42.98% <38.46%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/datetimes.py 97.98% <92.3%> (-0.58%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9bc42f9...c6256de. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 12, 2018

Codecov Report

Merging #24255 into master will increase coverage by <.01%.
The diff coverage is 98.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24255      +/-   ##
==========================================
+ Coverage   92.29%    92.3%   +<.01%     
==========================================
  Files         162      163       +1     
  Lines       51890    51932      +42     
==========================================
+ Hits        47892    47934      +42     
  Misses       3998     3998
Flag Coverage Δ
#multiple 90.71% <98.73%> (ø) ⬆️
#single 42.98% <55.69%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/datetimes.py 97.77% <100%> (+0.04%) ⬆️
pandas/core/arrays/_ranges.py 98.66% <98.66%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7bf6f2...43d3d83. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • why are you not using algos.checked_add_with_arr?
  • I am -1 on using uint64 here at all. why add all of this complexity, simply raise?

@jbrockmendel
Copy link
Member Author

why are you not using algos.checked_add_with_arr?

We are in one place, but it is mostly unnecessary, adds a lot of machinery for NA-masking that isn't relevant here.

I am -1 on using uint64 here at all. why add all of this complexity, simply raise?

Why would we want to raise on valid inputs? It's entirely possible that there is a simpler way to do this; I banged my head against it for a while before I got it working, so by the end elegance was not a priority.

These overflows were found when debugging the pickle problems in #24074. There is a test that creates a DTI (something like) dti = pd.date_range(start='1677-09-22', end='2262-04-11', freq='D'), pickles it, then tries to unpickle it. In the unpickling, it calls validate_frequency, which calls cls._generate_range(start=dti[0], periods=len(dti), freq=dti.freq), which in master currently returns an incorrect result because of a silent overflow in multiplication.

@jreback
Copy link
Contributor

jreback commented Dec 13, 2018

Why would we want to raise on valid inputs? It's entirely possible that there is a simpler way to do this; I banged my head against it for a while before I got it working, so by the end elegance was not a priority

how are these valid inputs iif you are overflowing int64? I would rather not jump thru hoops and have over complex code.

@jreback jreback added Bug Datetime Datetime data dtype labels Dec 13, 2018
@jbrockmendel
Copy link
Member Author

how are these valid inputs iif you are overflowing int64?

Because the biggest number we can get by adding an int64 to np.iinfo(np.int64).min is -1, but we can have date_ranges that start near the lower implementation bound and proceed well past 1970-01-01.

I claim that for any DTI produced with dti = pd.date_range(start=start, end=end, freq=freq), we should be able to write dti2 = pd.date_range(start=dti[0], periods=len(dti), freq=dti.freq) and have assert dti2.equals(dti). Ditto for dti3 = pd.date_range('end=dti[-1], periods=len(dti), freq=dti.freq).

@jbrockmendel
Copy link
Member Author

Travis fail is Hypothesis

@@ -1747,7 +1747,8 @@ def _generate_regular_range(cls, start, end, periods, freq):
return data


def _generate_range_overflow_safe(endpoint, periods, stride, side='start'):
def _generate_range_overflow_safe(endpoint, periods, stride,
side='start'):
"""
Calculate the second endpoint for passing to np.arange, checking
to avoid an integer overflow. Catch OverflowError and re-raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you doc-string this a bit more (say what the parameters mean)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

periods : int
stride : int
side : {'start', 'end'}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so pls move all of this new code to the same place as checked_add_with_arr

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

algos seems like a weird place to put datetime-specific code, no? I guess in principle this code could be used for generating int64 ranges...

@@ -88,6 +88,33 @@ def test_date_range_nat(self):
with pytest.raises(ValueError, match=msg):
date_range(start=pd.NaT, end='2016-01-01', freq='D')

def test_date_range_multiplication_overflow(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so 2 test cases fully exercise all of this code???

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I'm tracking down untested cases now.

if result <= i64max + np.uint64(stride):
return result
else:
return _generate_range_recurse(endpoint, periods,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these 2 cases?

why is this recursion? why can't you do this iteratively?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this recursion? why can't you do this iteratively?

generate_range_recurse breaks the problem down into two smaller pieces and calls _generate_range_overflow_safe twice. It is iterative in the sense that it makes two calls, as opposed to repeated calls to itself. It is recursive inasmuch as it is called via _generate_range_overflow_safe, which it also calls. I'm open to other ideas for names.

what are these 2 cases?

Which two cases are you asking about?

@jbrockmendel
Copy link
Member Author

Updated with a couple of tests for previously un-hit cases, removed some code that turned out to be un-reachable, fleshed out docstring params.

It turns out the call to np.arange can overflow. Now catch that too (with test)

@jbrockmendel jbrockmendel mentioned this pull request Dec 19, 2018
12 tasks
@jbrockmendel
Copy link
Member Author

gentle ping. I think this will fix the inability to do frequency validation discussed in #24024.

"if a 'period' is given.")

with np.errstate(over="raise"):
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on why you need this

return _generate_range_overflow_safe_signed(
endpoint, periods, stride, side)

elif ((endpoint > 0 and side == 'start' and stride > 0) or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this case already hendled in _generate_range_overflow_safe_signed? l169

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The check on 169 is specific to the case where periods * stride doesn't overflow int64 bounds. The check here on 127 is for the case where it does.

That said, this check could be removed. It is a fast-path to fail early.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k make sure that all of these paths are hit and tested - make as simple as possible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double-checked: every branch in here is hit in the tests

@jreback jreback added this to the 0.24.0 milestone Dec 23, 2018
@jreback jreback merged commit 1cd077a into pandas-dev:master Dec 23, 2018
@jreback
Copy link
Contributor

jreback commented Dec 23, 2018

thanks

@jbrockmendel jbrockmendel deleted the over branch December 23, 2018 17:33
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: un-caught date_range overflows
4 participants