Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: multi-timezone handling for array_to_datetime #24006

Closed
wants to merge 12 commits into from

Conversation

jbrockmendel
Copy link
Member

cc @mroeschke

ATM array_to_datetime handles strings and datetime objects very differently, with conversion.datetime_to_datetime64 picking up (some of) the slack. This unifies the treatment of strings/datetimes within array_to_datetime, rendering conversion_to_datetime64 (and some ugly try/excepts in pd.to_datetime) unnecessary.

As of now there are still 5 tests failing locally; resolving them will involve some design decisions.

  1. This PR introduces cases where dateutil's UTC or dateutil tzoffsets are returned while the test expects the equivalent pytz object. We can either a) try to convert them within array_to_datetime to more consistently return pytz objects, b) change the tests to expect the dateutil versions, or c) change the tests to not care.

  2. In the status quo we do (and test) something weird:

vals = [pd.Timestamp('2011-01-01 10:00'), pd.Timestamp('2011-01-02 10:00', tz='US/Eastern')]
# i.e. one tz-naive, one tz-aware

>>> pd.to_datetime(vals)
DatetimeIndex(['2011-01-01 05:00:00-05:00', '2011-01-02 10:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

# for strings we do something more reasonable

>>> pd.to_datetime([str(x) for x in vals])
Index([2011-01-01 10:00:00, 2011-01-02 10:00:00-05:00], dtype='object')

This PR changes the first call to behave like the second.

2b) We also need to decide whether datetime64 are considered naive or UTC within array_to_datetime

@pep8speaks
Copy link

pep8speaks commented Nov 29, 2018

Hello @jbrockmendel! Thanks for updating the PR.

Line 554:80: E501 line too long (102 > 79 characters)

Line 286:28: E127 continuation line over-indented for visual indent

Line 969:80: E501 line too long (80 > 79 characters)

Comment last updated on December 09, 2018 at 00:46 Hours UTC

@mroeschke
Copy link
Member

Response to your questions:

  1. During my is_utc overall, I added a test for a case where a dateutil UTC was getting coerced to a pytz UTC. IMO I think it's a bug if we coerce dateutil objects to pytz objects or vice versa so those test should be fixed. 94ce05d#diff-ad26614f192ef4be1906047f739c3644R592

1-sidenote) The interesting case becomes what happens if you mix dateutil and pytz objects (same timezone or not). My gut reaction is to cast to object, but I could see how raising could also be appropriate.

  1. I agree that the second behavior is better than the first.

2b) Well the datetime64 will technically always be naive since its the numpy type? But I think more to your point, I think it should represent UTC inside array_to_datetime and should return UTC if timezone info was parsed.

return tz_cache_key(tz)


cdef fixed_offset_to_pytz(tz):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in my comment, I don't think we shouldn't be converting dateutil objects to pytz objects unless completely necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to agree, am seeing now how many tests break if I change this.

What if we make get_key return "UTC" for [pytz, tzutc(), timezone.utc], and similarly map all FixedOffsets/tzoffsets to appropriate integers? Then because in this PR we're using a dict instead of a set, if two tzinfos are "equivalent" (i.e. produce the same key) then we will just end up using whatever was the last tzinfo with that key. At least in this subset of cases, I think the intention is Sufficiently Clear.

The downside is that to_datetime(arr) and to_datetime(arr[::-1]) could then have different tzinfos objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand some implications:

  • 3 timestamps with pytz.UTC, dateutil.tz.tzutc(), datetime.timezone.utc respectively would be coerced to one tz instance (the last one as you mention)
  • 2 timestamps with US/Pacific and dateutil/US/Pacific respectively would have their tzs coerced to one tz instance (the last one as you mention)

In general I am not a fan of this coercion. I would prefer to maintain the individual timezones instances (or raise) over coercion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth separately considering pd.to_datetime(...) vs pd.DatetimeIndex(...). The latter has a much stronger case for coercing equivalent tzinfos. For the former, mixing and matching is sufficiently weird that the user might be doing it intentionally.

Also note that the existing code coerces dateutil tzoffsets parsed from strings to pytz.FixedOffset. As a result, to_datetime(x) and to_datetime(Timestamp(x)) can have different tzinfos. Not quite the same situation, but similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coercing equivalent tzinfos

If we end up doing this then, I would prefer to pick a consistent timezone object (probably pytz) and not pick a psudo-random one (e.g. pick the last one).

In the case of mixed timezones though, would we coerce to object or raise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to pick a consistent timezone object (probably pytz) and not pick a psudo-random one (e.g. pick the last one).

So we would need some kind of hierarchy like:

def choose_utc(utcs):
    if pytz.utc in utcs:
        return pytz.utc
    elif PY3 and datetime.timezone.utc in utcs:
        return datetime.timezone.utc
    elif ...

def choose_fixed_offset(fixed):
    ....

In the case of mixed timezones though, would we coerce to object or raise?

The approach I took in #23675 is to have objects_to_dt64ns have an allow_object kwarg that determines this.

Copy link
Member

@mroeschke mroeschke Dec 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another (far in the future) option is to use a pandas-equivalent tzinfo as discussed here: #23959 (comment). But yeah I hierarchy would be needed in the short term.

Since objects_to_dt64ns is just used internally, in what instances are we choosing for allow_object to be True vs False?

@@ -617,6 +647,8 @@ cpdef array_to_datetime(ndarray[object] values, str errors='raise',
# A ValueError at this point is a _parsing_ error
# specifically _not_ OutOfBoundsDatetime
if _parse_today_now(val, &iresult[i]):
# TODO: Do we treat this as local?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should follow datetime.now and datetime.today conventions and return local for both. IIRC there was discussion somewhere around this issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's #18705 suggesting to_datetime("now") should match Timestamp("now").

Consider two cases:

ts = pd.Timestamp('2018-11-29 18:02')
ts2 = ts.tz_localize('US/Pacific')

vals1 = [ts, 'now', 'today', ts.asm8]
vals2 = [ts2, 'now', 'today', ts.asm8]

dti1 = pd.to_datetime(vals1).tz_localize('US/Pacific')
dti2 = pd.to_datetime(vals2)  # raises in master

assert dti1[0] == dti2[0]

>>> dti1[1] - dti2[1]
Timedelta('0 days 07:59:59.999065')

>>> dti1[2] - dti2[2]
Timedelta('0 days 07:59:59.998852')

>>> dti1[3] - dti2[3]
Timedelta('0 days 08:00:00')

I'm not wild about having the parsed meaning for the last three entries depending on first entry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think that to_datetime("now") should match Timestamp("now"). I am not sure if I fully understand your example's point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I fully understand your example's point.

The point is that to_datetime([a, b, c])[1:] should not depend on a (except possibly for whether it is object-dtype)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke thoughts on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay yeah I agree with your example. One argument's timezone information shouldn't propagate to the other arguments. I would opt for vals2 to be cast to an Index with object dtype.

@gfyoung gfyoung added Datetime Datetime data dtype Timezones Timezone data dtype labels Nov 30, 2018
@codecov
Copy link

codecov bot commented Dec 1, 2018

Codecov Report

Merging #24006 into master will decrease coverage by 0.57%.
The diff coverage is 55.81%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24006      +/-   ##
==========================================
- Coverage   43.02%   42.44%   -0.58%     
==========================================
  Files         162      161       -1     
  Lines       51700    51563     -137     
==========================================
- Hits        22245    21887     -358     
- Misses      29455    29676     +221
Flag Coverage Δ
#single 42.44% <55.81%> (-0.58%) ⬇️
Impacted Files Coverage Δ
pandas/core/series.py 50.74% <0%> (+1.41%) ⬆️
pandas/core/dtypes/cast.py 48.41% <100%> (-0.18%) ⬇️
pandas/core/tools/datetimes.py 34.68% <41.66%> (+0.63%) ⬆️
pandas/util/testing.py 51.88% <66.66%> (+0.26%) ⬆️
pandas/io/json/json.py 16.66% <0%> (-46.2%) ⬇️
pandas/io/common.py 38.75% <0%> (-4.66%) ⬇️
pandas/core/arrays/sparse.py 40.42% <0%> (-4.43%) ⬇️
pandas/core/dtypes/common.py 69.9% <0%> (-2.86%) ⬇️
pandas/io/formats/format.py 50.35% <0%> (-2.83%) ⬇️
pandas/core/sparse/series.py 43.75% <0%> (-2.68%) ⬇️
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update defa8a8...ec35002. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 1, 2018

Codecov Report

Merging #24006 into master will increase coverage by <.01%.
The diff coverage is 55.81%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24006      +/-   ##
==========================================
+ Coverage   42.46%   42.46%   +<.01%     
==========================================
  Files         161      161              
  Lines       51557    51558       +1     
==========================================
+ Hits        21892    21893       +1     
  Misses      29665    29665
Flag Coverage Δ
#single 42.46% <55.81%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/series.py 50.74% <0%> (ø) ⬆️
pandas/core/dtypes/cast.py 48.41% <100%> (-0.02%) ⬇️
pandas/core/tools/datetimes.py 34.68% <41.66%> (-0.39%) ⬇️
pandas/util/testing.py 51.88% <66.66%> (+0.07%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b0610b...a2a01c9. Read the comment docs.

@@ -459,6 +461,26 @@ def array_with_unit_to_datetime(ndarray values, object unit,
return oresult


cdef get_key(tz):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type & add a doc-string


v, inferred_tz = tslib.array_to_datetime(v,
require_iso8601=True,
errors='raise')
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still needed?

if isinstance(left, DatetimeIndex):
# by now we know right is also a DatetimeIndex
assert_numpy_array_equal(left.asi8, right.asi8)
assert tz_compare(left.tz, right.tz)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is any of these changes needed? this is really suspect. instead the tz comparison should be done in assert_index_equal (which IIRC) is already done, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very kludgy and all needs to be pushed up the stack. I tentatively think this particular check will be fixed by making Datetime64TZDtype.__eq__ use tz_compare, at which point the changes here won't be necessary.

if is_fixed_offset(start) and is_fixed_offset(end):
start_seconds = get_fixed_offset_total_seconds(start)
end_seconds = get_fixed_offset_total_seconds(end)
return start_seconds == end_seconds
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke @jreback are we in agreement that two FixedOffsets of matching length should be considered equal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems reasonsable

can u just compare the start == end ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No:

>>> off1 = pytz.FixedOffset(420)
>>> off2 = dateutil.tz.tzoffset(None, 420*60)
>>> off1 == off2
False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that makes sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable.

If a user passes both a pytz.FixedOffset and a dateutil.tz.tzoffset will be coercing to one of the tzinfos? Once subtle point is if the dateutil.tz.tzoffset has a name but has the same offset as the pytz.FixedOffset, we should opt to keeping the dateutil instance so we don't drop the name.

@WillAyd
Copy link
Member

WillAyd commented Feb 27, 2019

@jbrockmendel still relevant?

@jbrockmendel
Copy link
Member Author

still relevant?

Yes. I'm in a less-active phase at the moment, but this is still a problem that needs to be solved. I'll return to it before too long.

@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

closing as stale

@WillAyd
Copy link
Member

WillAyd commented Apr 22, 2019

Closing per above comment. Can always be reopened

@WillAyd WillAyd closed this Apr 22, 2019
@jbrockmendel jbrockmendel deleted the a2d branch April 5, 2020 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants