-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/API: to_datetime preserves UTC offsets when parsing datetime strings #21822
Conversation
pandas/_libs/tslib.pyx
Outdated
if not is_same_offsets: | ||
raise TypeError | ||
else: | ||
# Open question: should this return dateutil offset or pytz offset? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default to dateutil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR then, is it okay that parsing through Timestamp
will produce a pytz.FixedOffset
and to_datetime
will producea dateutil.tz.tzoffset
?
I should probably start a larger discussion whether we should be migrating from pytz to dateutil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, forgot that Timestamp
defaulted to pytz.FixedOffset
. Sharing code with the Timestamp
constructor is definitely a higher priority than defaulting to dateutil.tz
.
pandas/core/tools/datetimes.py
Outdated
arg, | ||
errors=errors, | ||
utc=tz == 'utc', | ||
dayfirst=dayfirst, | ||
yearfirst=yearfirst, | ||
require_iso8601=require_iso8601 | ||
) | ||
if tz_parsed is not None: | ||
return DatetimeIndex._simple_new(result, name=name, | ||
tz=tz_parsed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case with multiple tzs that has to get wrapped in object-dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That case will result in tz_parsed = None
so this branch will not be hit.
Looking over the test failures, most of them are clearly the-test-is-wrong. |
Codecov Report
@@ Coverage Diff @@
## master #21822 +/- ##
==========================================
+ Coverage 92.07% 92.07% +<.01%
==========================================
Files 170 170
Lines 50690 50696 +6
==========================================
+ Hits 46672 46678 +6
Misses 4018 4018
Continue to review full report at Codecov.
|
2) datetime.datetime objects, if OutOfBoundsDatetime or TypeError | ||
is encountered | ||
|
||
Also returns a pytz.FixedOffset if an array of strings with the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle other tzinfos could be returned, specifically if it falls back to dateutil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specifying that array_to_datetime
function itself can return a pytz.FixedOffset
or None
from the C parser output. If it goes through the dateutil parser in the non-ValueError
branch, I don't think there's a way to recover the timezone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When parse_datetime_string
is called if there's a timezone it should return a tz-aware datetime
object, so the tzinfo can be pulled off that can't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right that's true, good catch. Sure I will try to add some tests and functionality to hit the dateutil parser before the object branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I am using a set
to store the parsed timezone offsets (which should be more performant that using an array in theory / I was having some trouble using an array due to duplicates), however dateutil.tz.tzoffset
s cannot be hashed: dateutil/dateutil#792
So instead, I am saving the total_seconds()
of the dateutil tzoffset in the set instead and reconstructing the offsets as pytz.FixedOffset
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add Parameters here
pandas/_libs/tslib.pyx
Outdated
|
||
Returns | ||
------- | ||
(ndarray, timezone offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
underscore between timezone and offset?
pandas/_libs/tslib.pyx
Outdated
# the parsed dates to (maybe) pass to DatetimeIndex | ||
# 2) If the offsets are different, then force the parsing down the | ||
# object path where an array of datetimes | ||
# (with individual datutil.tzoffsets) are returned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo datutil
pandas/_libs/tslib.pyx
Outdated
# Faster to compare integers than to compare objects | ||
is_same_offsets = (out_tzoffset_vals[0] == out_tzoffset_vals).all() | ||
if not is_same_offsets: | ||
raise TypeError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (pre-existing) pattern is pretty opaque to a first-time reader. What if instead of raising TypeError
the fallback block became its own function that gets called from here?
assert is_datetime64_ns_dtype(i) | ||
|
||
# tz coerceion | ||
result = pd.to_datetime(i, errors='coerce', cache=cache) | ||
tm.assert_index_equal(result, i) | ||
|
||
result = pd.to_datetime(i, errors='coerce', utc=True, cache=cache) | ||
expected = pd.DatetimeIndex(['2000-01-01 13:00:00'], | ||
expected = pd.DatetimeIndex(['2000-01-01 08:00:00'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The +00:00
in i
above required a tz_convert
post construction while I think the original intention was to tz_localize
it to the passed psycopg.tz
in the constructor (leading to this change)
But to reflect the intension of the original test, I just removed the +00:00
instead
|
||
# TODO: Appears that parsing non-ISO strings adjust the date to UTC | ||
# but don't return the offset. Not sure if this is the intended | ||
# behavior of non-iso strings in np_datetime_strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np_datetime_strings doesn't handle non-ISO. That case ends up going through dateutil (via parse_datetime_string
). I'm not totally sure what the TODO is for. '01-01-2013 00:00:00'
comes back tz-naive, right?
pandas/_libs/tslib.pyx
Outdated
# (with individual datutil.tzoffsets) are returned | ||
|
||
# Faster to compare integers than to compare objects | ||
is_same_offsets = (out_tzoffset_vals[0] == out_tzoffset_vals).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be a perf tradeoff here, specifically in the case where we have all-strings, all of which are ISO, but that don't have matching timezones. Going through the parse_datetime_string
path below is much slower than _string_to_dts
. Going through the python path entails a big hit.
The various paths (including require_iso8859
ugh) make this a giant hassle. @jreback one way to simplify this hassle would be to strengthen the requirement on require_iso8859
. ATM it raises if it sees a non-ISO string, but is fine with datetime/np.datetime64 objects. If it were strings-only, then a bunch of logic could be simplified (not necessarily this PR). Thoughts?
A few comments, mostly about some tough perf tradeoffs. Generally this looks great; I'm really psyched to see this get fixed. |
Addressed your comments @jreback & @jbrockmendel. Open for a final look. |
pandas/_libs/tslib.pyx
Outdated
yearfirst=yearfirst) | ||
pydatetime_to_dt64(oresult[i], &dts) | ||
check_dts_bounds(&dts) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could probably be more specific. parse_datetime_string could raise ValueError or OverflowError, check_dts_bounds raises OutOfBoundsDatetime (which subclasses ValueError), and I think thats it.
Does this make |
This doesn't fix "now". ("today" looks like it was handled in #19937) |
can you rebase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments. @jbrockmendel any comments?
|
||
*Previous Behavior*: | ||
|
||
.. code-block:: ipython |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this directiive you just be python (rather than ipython here) @TomAugspurger ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at previous whatsnews, looks like either ipython or python works for this directive.
@@ -123,11 +147,11 @@ def test_coerce_of_invalid_datetimes(self): | |||
|
|||
# Without coercing, the presence of any invalid dates prevents | |||
# any values from being converted | |||
result = tslib.array_to_datetime(arr, errors='ignore') | |||
result = tslib.array_to_datetime(arr, errors='ignore')[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like writing these like
result, _ = ......
I think its more clear
@mroeschke can you also run some timeseries benchmarks generally to see if any effects (I would expected some small slowdown for parsing but but not significant). You may need to add some benchmarks to capture the new code paths. |
Just that I’m pretty stoked to see this fixed. |
From one new ASV I added, parsing strings with different offests will be slower, but I think it's acceptable given the new (more correct) behavior
|
ncie patch @mroeschke ! |
# is comparison fails unlike other dateutil.tz | ||
# objects. Also, dateutil.tz.tzlocal has no | ||
# _offset attribute like tzoffset | ||
offset_seconds = tz._dst_offset.total_seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is using an internal method and is fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Suggestion for a better way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what this function is supposed to do, but from what I do understand it looks like a mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original intent with this section was to collect the dateutil timezone objects in a set
and determine if more than 1 distinct timezone object was collected in the end.
However since dateutil timezones cannot be hashed (thanks for starting progress on the associated issue btw @pganssle), I instead store the timezone's UTC offsets in seconds instead (which is what I essentially care about).
It wasn't too apparent to me if there was a public way to access a dateutil timezone's UTC offset, hence why I am use a private method here. Definitely open to a better method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you can't find a public method you probably shouldn't dive into the private API. There are a few assumptions in here that won't survive future versions of dateutil
for sure. The tzlocal
thing also actually gives you the wrong answer, because tzlocal
is not a fixed offset and you're taking the DST offset.
If you want the offset, you should do dt.utcoffset()
, regardless of whether it's dateutil
or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are significantly worried about speed, you can probably use the fact that in dateutil >= 2.7.0
, dateutil.tz.tzutc()
returns a singleton, and dateutil.tz.tzoffset(*args1)
is the same object as any other dateutil.tz.tzoffset(*args2)
where args1 == args2
. So you can probably store a mapping between id(obj)
and the result of obj.tzoffset(datetime(1970, 1, 1))
for that specific subclass (since it is also guaranteed to have no DST).
git diff upstream/master -u -- "*.py" | flake8 --diff
This PR makes
to_datetime(ts_string_with_offset)
andDatetimeIndex([ts_string_with_offset])
to now matchTimestamp(ts_string_with_offset)