-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix overflow bugs in date_Range #24255
Conversation
Hello @jbrockmendel! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on December 23, 2018 at 16:48 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #24255 +/- ##
==========================================
- Coverage 92.21% 92.21% -0.01%
==========================================
Files 162 162
Lines 51768 51806 +38
==========================================
+ Hits 47739 47773 +34
- Misses 4029 4033 +4
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24255 +/- ##
==========================================
+ Coverage 92.29% 92.3% +<.01%
==========================================
Files 162 163 +1
Lines 51890 51932 +42
==========================================
+ Hits 47892 47934 +42
Misses 3998 3998
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- why are you not using
algos.checked_add_with_arr
? - I am -1 on using uint64 here at all. why add all of this complexity, simply raise?
We are in one place, but it is mostly unnecessary, adds a lot of machinery for NA-masking that isn't relevant here.
Why would we want to raise on valid inputs? It's entirely possible that there is a simpler way to do this; I banged my head against it for a while before I got it working, so by the end elegance was not a priority. These overflows were found when debugging the pickle problems in #24074. There is a test that creates a DTI (something like) |
how are these valid inputs iif you are overflowing int64? I would rather not jump thru hoops and have over complex code. |
Because the biggest number we can get by adding an int64 to I claim that for any DTI produced with |
Travis fail is Hypothesis |
pandas/core/arrays/datetimes.py
Outdated
@@ -1747,7 +1747,8 @@ def _generate_regular_range(cls, start, end, periods, freq): | |||
return data | |||
|
|||
|
|||
def _generate_range_overflow_safe(endpoint, periods, stride, side='start'): | |||
def _generate_range_overflow_safe(endpoint, periods, stride, | |||
side='start'): | |||
""" | |||
Calculate the second endpoint for passing to np.arange, checking | |||
to avoid an integer overflow. Catch OverflowError and re-raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you doc-string this a bit more (say what the parameters mean)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
pandas/core/arrays/datetimes.py
Outdated
periods : int | ||
stride : int | ||
side : {'start', 'end'} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so pls move all of this new code to the same place as checked_add_with_arr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
algos seems like a weird place to put datetime-specific code, no? I guess in principle this code could be used for generating int64 ranges...
@@ -88,6 +88,33 @@ def test_date_range_nat(self): | |||
with pytest.raises(ValueError, match=msg): | |||
date_range(start=pd.NaT, end='2016-01-01', freq='D') | |||
|
|||
def test_date_range_multiplication_overflow(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so 2 test cases fully exercise all of this code???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I'm tracking down untested cases now.
pandas/core/arrays/datetimes.py
Outdated
if result <= i64max + np.uint64(stride): | ||
return result | ||
else: | ||
return _generate_range_recurse(endpoint, periods, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are these 2 cases?
why is this recursion? why can't you do this iteratively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this recursion? why can't you do this iteratively?
generate_range_recurse
breaks the problem down into two smaller pieces and calls _generate_range_overflow_safe
twice. It is iterative in the sense that it makes two calls, as opposed to repeated calls to itself. It is recursive inasmuch as it is called via _generate_range_overflow_safe
, which it also calls. I'm open to other ideas for names.
what are these 2 cases?
Which two cases are you asking about?
Updated with a couple of tests for previously un-hit cases, removed some code that turned out to be un-reachable, fleshed out docstring params. It turns out the call to |
gentle ping. I think this will fix the inability to do frequency validation discussed in #24024. |
"if a 'period' is given.") | ||
|
||
with np.errstate(over="raise"): | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on why you need this
return _generate_range_overflow_safe_signed( | ||
endpoint, periods, stride, side) | ||
|
||
elif ((endpoint > 0 and side == 'start' and stride > 0) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this case already hendled in _generate_range_overflow_safe_signed? l169
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The check on 169 is specific to the case where periods * stride doesn't overflow int64 bounds. The check here on 127 is for the case where it does.
That said, this check could be removed. It is a fast-path to fail early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k make sure that all of these paths are hit and tested - make as simple as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double-checked: every branch in here is hit in the tests
thanks |
git diff upstream/master -u -- "*.py" | flake8 --diff
Fixing this turned out to be a small beast. The overflows of interest fall into three categories. First: overflows when computing
periods * stride
that ATM are not caught, so give incorrect results:Second, overflows when
periods * stride
overflow the int64 bounds but not uint64 bounds, which in some cases are recoverable. The following should not raise:Last is a similar one with a negative stride where we can recover, but not by converting to uint64: