Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch exception around much smaller piece of code #23702

Closed
wants to merge 3 commits into from

Conversation

jbrockmendel
Copy link
Member

Broken off of #23675, followed by refactor.

The refactor has a non-trivial diff, but the only actual logic changed here is the except ValueError as e, instead of catching everything from L240-L303, now specifically only catches errors raised by tslib.array_to_datetime. Everything else is just moving code outside of the try/except block and de-indenting.

Catching more specific exceptions is worthwhile in its own right, but this is also an important step for implementing DatetimeArray._from_sequence (@TomAugspurger, @jorisvandenbossche, @jreback) since after this we can isolate the part of to_datetime called by DatetimeIndex.__new__ (xref #23675) to make the constructor non-circular (and then refactor it out into _from_sequence).

cc @mroeschke

@pep8speaks
Copy link

pep8speaks commented Nov 14, 2018

Hello @jbrockmendel! Thanks for updating the PR.

Line 250:12: E111 indentation is not a multiple of four

Comment last updated on November 15, 2018 at 02:51 Hours UTC

@gfyoung gfyoung added Refactor Internal refactoring of code Datetime Datetime data dtype labels Nov 14, 2018
@jbrockmendel
Copy link
Member Author

Azure fail looks unrelated

# reached when array_to_datetime raises "unless utc=True..."
try:
values, tz = conversion.datetime_to_datetime64(arg)
return DatetimeIndex._simple_new(values, name=name, tz=tz)
Copy link
Member

@mroeschke mroeschke Nov 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a future bug here in which the user passes box=False and we hit this part and return a DatetimeIndex instead of an array. (By inspection looks like the bug already would exist.)

I don't exactly know the criteria to set this up, but is it possible to delegate this to _maybe_box_date_results?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is #23045 related to the potential bug you have in mind?

but is it possible to delegate this to _maybe_box_date_results?

It's a little ungainly, but I think it can be done.

Copy link
Member

@mroeschke mroeschke Nov 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly, here's a quick example I found (albeit questionable if this is the correct behavior)

In [3]: pd.__version__
Out[3]: '0.24.0.dev0+1010.ge413c491e.dirty'

In [4]: import pytz; import datetime

In [5]: pd.to_datetime([datetime.datetime(2018, 1, 1), datetime.datetime(2018, 1, 1, tzinfo=pytz.UTC)], box=False)
Out[5]: DatetimeIndex(['2018-01-01 00:00:00+00:00', '2018-01-01 00:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

Doesn't necessarily have to be handled here; we can create a new issue for this but it hits this ValueError path.


if result is None and (format is None or infer_datetime_format):

if result is None and (format is None or infer_datetime_format):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment here why we might hit this branch? It looks like if no format was passed, or a format was passed and it failed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the branch that gets hit the most often, e.g. pd.to_datetime([pd.Timestamp.now(), '2015/04/05']) will go down this route.

Any list/object-dtyped case where a format is not passed.

Any list/object-dtype case where a format is passed, fails, and is allowed to try the fallback (via infer_datetime_format kwarg, which defaults to None)

@codecov
Copy link

codecov bot commented Nov 14, 2018

Codecov Report

Merging #23702 into master will increase coverage by <.01%.
The diff coverage is 87.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23702      +/-   ##
==========================================
+ Coverage   92.24%   92.25%   +<.01%     
==========================================
  Files         161      161              
  Lines       51339    51343       +4     
==========================================
+ Hits        47360    47364       +4     
  Misses       3979     3979
Flag Coverage Δ
#multiple 90.64% <87.5%> (ø) ⬆️
#single 42.34% <50%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/tools/datetimes.py 85.4% <87.5%> (+0.13%) ⬆️
pandas/core/frame.py 97.02% <0%> (ø) ⬆️
pandas/core/series.py 93.68% <0%> (ø) ⬆️
pandas/core/groupby/groupby.py 96.51% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b4b945a...e725875. Read the comment docs.

@TomAugspurger
Copy link
Contributor

The changes seem fine, but I'm not familiar with this section of the code (and in particular I don't know how well tested the expected failure conditions are).

@jbrockmendel
Copy link
Member Author

Looks like azure failure in is in Hypothesis

@jbrockmendel
Copy link
Member Author

The changes seem fine, but I'm not familiar with this section of the code (and in particular I don't know how well tested the expected failure conditions are).

AFAIK the tests specific to this file are mostly in tests.indexes.datetimes.test_tools. I'll need to take a closer look at these tests to see if they can be made more specific, but so far they seem fairly thorough

@jorisvandenbossche
Copy link
Member

Is there any behavioural difference?

@jbrockmendel
Copy link
Member Author

Is there any behavioural difference?

Dangit, I just found a corner case in which there is:

>>> ts = pd.Timestamp.now()
>>> pd.to_datetime([ts], format='%Y%m%d')

returns DTI in master, raises ValueError in PR. Of course the user is a monster for passing this input, but still.

Another round of edits coming up in a bit.

@jorisvandenbossche
Copy link
Member

Hmm, that seems like something that does not necessarily need to work .. (if a user passes format, shouldn't the value be strings?)

Back to my original question: there is no intended behavioural change?

@jbrockmendel
Copy link
Member Author

there is no intended behavioural change?

Correct.

@jorisvandenbossche
Copy link
Member

But what was then the original reason to do this? There is some change I would assume? (didn't follow the discussion in the PR you are referencing) I now only catch more specific error. But so you encountered a case where another error was catched that should not have been catched?

@jbrockmendel
Copy link
Member Author

But what was then the original reason to do this?

Following this refactor, the call to array_to_datetime, along with the try/except that wraps it, can be refactored into a more tightly-scoped function doesnt call DatetimeIndex.__new__ or DatetimeIndex._simple_new. This function could then be called from DatetimeIndex.__new__ (per discussion in #23675) instead of to_datetime, resolving the existing circularity.

(Good call asking me to make that explicit; I should have explained that from the start)

@jbrockmendel
Copy link
Member Author

@mroeschke you recently implemented the timezone checking in tslib.array_to_datetime for strings there. Could the same machinery be extended to handle datetimes? Because AFAICT that's the only thing that conversion.datetime_to_datetime64 does. If we can get that logic in one place, we can avoid the fallback to datetime_to_datetime64 (which is itself only called in two places, so we may be able to remove it entirely)

@mroeschke
Copy link
Member

mroeschke commented Nov 15, 2018

@jbrockmendel Definitely. Not sure what it would look like since I had to weave it into the beast of tslib.array_to_datetime, but I don't see why not.


def _maybe_box_date_results(result, box, tz, name, tz_parsed=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a docstring here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@jbrockmendel
Copy link
Member Author

Good news everyone! I found a case that behaves differently on OSX vs Linux in master

now = Timestamp.now()
values = np.array([now, "2018-11-12"], dtype=np.object_)
result = to_datetime(values, format="%Y%m%d",
                     infer_datetime_format=True)

raises on Linux, returns correct result on OSX. Both cases in py27. (waiting for setup.py build_ext to run so i can check 36/37)

This is turning out hairier than expected. I think The Right Way to do this involves combining array_to_datetime with datetime_to_datetime64 as discussed above, and probably making the behavior in the require_iso8601 case more strict (but I'll open a separate issue and/or PR for that)

@jbrockmendel
Copy link
Member Author

In py3 this raises on both OSX and Linux

@jbrockmendel
Copy link
Member Author

I'm... not happy with how this is turning out. Closing.

@jbrockmendel jbrockmendel deleted the pre-fixes5 branch April 5, 2020 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants