-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: multi-timezone handling for array_to_datetime #24006
Conversation
Hello @jbrockmendel! Thanks for updating the PR.
Comment last updated on December 09, 2018 at 00:46 Hours UTC |
Response to your questions:
1-sidenote) The interesting case becomes what happens if you mix dateutil and pytz objects (same timezone or not). My gut reaction is to cast to object, but I could see how raising could also be appropriate.
2b) Well the datetime64 will technically always be naive since its the numpy type? But I think more to your point, I think it should represent UTC inside |
pandas/_libs/tslib.pyx
Outdated
return tz_cache_key(tz) | ||
|
||
|
||
cdef fixed_offset_to_pytz(tz): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in my comment, I don't think we shouldn't be converting dateutil objects to pytz objects unless completely necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to agree, am seeing now how many tests break if I change this.
What if we make get_key
return "UTC" for [pytz, tzutc(), timezone.utc], and similarly map all FixedOffsets/tzoffsets to appropriate integers? Then because in this PR we're using a dict instead of a set, if two tzinfos are "equivalent" (i.e. produce the same key) then we will just end up using whatever was the last tzinfo with that key. At least in this subset of cases, I think the intention is Sufficiently Clear.
The downside is that to_datetime(arr)
and to_datetime(arr[::-1])
could then have different tzinfos objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I understand some implications:
- 3 timestamps with
pytz.UTC
,dateutil.tz.tzutc()
,datetime.timezone.utc
respectively would be coerced to one tz instance (the last one as you mention) - 2 timestamps with
US/Pacific
anddateutil/US/Pacific
respectively would have their tzs coerced to one tz instance (the last one as you mention)
In general I am not a fan of this coercion. I would prefer to maintain the individual timezones instances (or raise) over coercion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth separately considering pd.to_datetime(...)
vs pd.DatetimeIndex(...)
. The latter has a much stronger case for coercing equivalent tzinfos. For the former, mixing and matching is sufficiently weird that the user might be doing it intentionally.
Also note that the existing code coerces dateutil tzoffsets parsed from strings to pytz.FixedOffset. As a result, to_datetime(x)
and to_datetime(Timestamp(x))
can have different tzinfos. Not quite the same situation, but similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coercing equivalent tzinfos
If we end up doing this then, I would prefer to pick a consistent timezone object (probably pytz) and not pick a psudo-random one (e.g. pick the last one).
In the case of mixed timezones though, would we coerce to object or raise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to pick a consistent timezone object (probably pytz) and not pick a psudo-random one (e.g. pick the last one).
So we would need some kind of hierarchy like:
def choose_utc(utcs):
if pytz.utc in utcs:
return pytz.utc
elif PY3 and datetime.timezone.utc in utcs:
return datetime.timezone.utc
elif ...
def choose_fixed_offset(fixed):
....
In the case of mixed timezones though, would we coerce to object or raise?
The approach I took in #23675 is to have objects_to_dt64ns
have an allow_object
kwarg that determines this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another (far in the future) option is to use a pandas-equivalent tzinfo as discussed here: #23959 (comment). But yeah I hierarchy would be needed in the short term.
Since objects_to_dt64ns
is just used internally, in what instances are we choosing for allow_object
to be True vs False?
pandas/_libs/tslib.pyx
Outdated
@@ -617,6 +647,8 @@ cpdef array_to_datetime(ndarray[object] values, str errors='raise', | |||
# A ValueError at this point is a _parsing_ error | |||
# specifically _not_ OutOfBoundsDatetime | |||
if _parse_today_now(val, &iresult[i]): | |||
# TODO: Do we treat this as local? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should follow datetime.now
and datetime.today
conventions and return local for both. IIRC there was discussion somewhere around this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's #18705 suggesting to_datetime("now")
should match Timestamp("now")
.
Consider two cases:
ts = pd.Timestamp('2018-11-29 18:02')
ts2 = ts.tz_localize('US/Pacific')
vals1 = [ts, 'now', 'today', ts.asm8]
vals2 = [ts2, 'now', 'today', ts.asm8]
dti1 = pd.to_datetime(vals1).tz_localize('US/Pacific')
dti2 = pd.to_datetime(vals2) # raises in master
assert dti1[0] == dti2[0]
>>> dti1[1] - dti2[1]
Timedelta('0 days 07:59:59.999065')
>>> dti1[2] - dti2[2]
Timedelta('0 days 07:59:59.998852')
>>> dti1[3] - dti2[3]
Timedelta('0 days 08:00:00')
I'm not wild about having the parsed meaning for the last three entries depending on first entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think that to_datetime("now")
should match Timestamp("now")
. I am not sure if I fully understand your example's point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I fully understand your example's point.
The point is that to_datetime([a, b, c])[1:]
should not depend on a
(except possibly for whether it is object-dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay yeah I agree with your example. One argument's timezone information shouldn't propagate to the other arguments. I would opt for vals2
to be cast to an Index
with object dtype.
Codecov Report
@@ Coverage Diff @@
## master #24006 +/- ##
==========================================
- Coverage 43.02% 42.44% -0.58%
==========================================
Files 162 161 -1
Lines 51700 51563 -137
==========================================
- Hits 22245 21887 -358
- Misses 29455 29676 +221
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24006 +/- ##
==========================================
+ Coverage 42.46% 42.46% +<.01%
==========================================
Files 161 161
Lines 51557 51558 +1
==========================================
+ Hits 21892 21893 +1
Misses 29665 29665
Continue to review full report at Codecov.
|
@@ -459,6 +461,26 @@ def array_with_unit_to_datetime(ndarray values, object unit, | |||
return oresult | |||
|
|||
|
|||
cdef get_key(tz): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type & add a doc-string
|
||
v, inferred_tz = tslib.array_to_datetime(v, | ||
require_iso8601=True, | ||
errors='raise') | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still needed?
if isinstance(left, DatetimeIndex): | ||
# by now we know right is also a DatetimeIndex | ||
assert_numpy_array_equal(left.asi8, right.asi8) | ||
assert tz_compare(left.tz, right.tz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is any of these changes needed? this is really suspect. instead the tz comparison should be done in assert_index_equal (which IIRC) is already done, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very kludgy and all needs to be pushed up the stack. I tentatively think this particular check will be fixed by making Datetime64TZDtype.__eq__
use tz_compare, at which point the changes here won't be necessary.
if is_fixed_offset(start) and is_fixed_offset(end): | ||
start_seconds = get_fixed_offset_total_seconds(start) | ||
end_seconds = get_fixed_offset_total_seconds(end) | ||
return start_seconds == end_seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke @jreback are we in agreement that two FixedOffsets of matching length should be considered equal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems reasonsable
can u just compare the start == end ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No:
>>> off1 = pytz.FixedOffset(420)
>>> off2 = dateutil.tz.tzoffset(None, 420*60)
>>> off1 == off2
False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable.
If a user passes both a pytz.FixedOffset
and a dateutil.tz.tzoffset
will be coercing to one of the tzinfos? Once subtle point is if the dateutil.tz.tzoffset
has a name
but has the same offset as the pytz.FixedOffset
, we should opt to keeping the dateutil instance so we don't drop the name
.
@jbrockmendel still relevant? |
Yes. I'm in a less-active phase at the moment, but this is still a problem that needs to be solved. I'll return to it before too long. |
closing as stale |
Closing per above comment. Can always be reopened |
cc @mroeschke
ATM
array_to_datetime
handles strings and datetime objects very differently, withconversion.datetime_to_datetime64
picking up (some of) the slack. This unifies the treatment of strings/datetimes within array_to_datetime, renderingconversion_to_datetime64
(and some ugly try/excepts inpd.to_datetime
) unnecessary.As of now there are still 5 tests failing locally; resolving them will involve some design decisions.
This PR introduces cases where dateutil's UTC or dateutil tzoffsets are returned while the test expects the equivalent pytz object. We can either a) try to convert them within
array_to_datetime
to more consistently return pytz objects, b) change the tests to expect the dateutil versions, or c) change the tests to not care.In the status quo we do (and test) something weird:
This PR changes the first call to behave like the second.
2b) We also need to decide whether datetime64 are considered naive or UTC within
array_to_datetime