WIP: multi-timezone handling for array_to_datetime #24006

jbrockmendel · 2018-11-29T23:52:59Z

ATM array_to_datetime handles strings and datetime objects very differently, with conversion.datetime_to_datetime64 picking up (some of) the slack. This unifies the treatment of strings/datetimes within array_to_datetime, rendering conversion_to_datetime64 (and some ugly try/excepts in pd.to_datetime) unnecessary.

As of now there are still 5 tests failing locally; resolving them will involve some design decisions.

This PR introduces cases where dateutil's UTC or dateutil tzoffsets are returned while the test expects the equivalent pytz object. We can either a) try to convert them within array_to_datetime to more consistently return pytz objects, b) change the tests to expect the dateutil versions, or c) change the tests to not care.
In the status quo we do (and test) something weird:

vals = [pd.Timestamp('2011-01-01 10:00'), pd.Timestamp('2011-01-02 10:00', tz='US/Eastern')]
# i.e. one tz-naive, one tz-aware

>>> pd.to_datetime(vals)
DatetimeIndex(['2011-01-01 05:00:00-05:00', '2011-01-02 10:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

# for strings we do something more reasonable

>>> pd.to_datetime([str(x) for x in vals])
Index([2011-01-01 10:00:00, 2011-01-02 10:00:00-05:00], dtype='object')

This PR changes the first call to behave like the second.

2b) We also need to decide whether datetime64 are considered naive or UTC within array_to_datetime

pep8speaks · 2018-11-29T23:53:04Z

Hello @jbrockmendel! Thanks for updating the PR.

There are no PEP8 issues in the file pandas/core/dtypes/cast.py !
There are no PEP8 issues in the file pandas/core/series.py !
In the file pandas/core/tools/datetimes.py, following are the PEP8 issues :

Line 554:80: E501 line too long (102 > 79 characters)

In the file pandas/tests/indexes/datetimes/test_construction.py, following are the PEP8 issues :

Line 286:28: E127 continuation line over-indented for visual indent

There are no PEP8 issues in the file pandas/tests/indexes/datetimes/test_tools.py !
There are no PEP8 issues in the file pandas/tests/tslibs/test_array_to_datetime.py !
In the file pandas/util/testing.py, following are the PEP8 issues :

Line 969:80: E501 line too long (80 > 79 characters)

Comment last updated on December 09, 2018 at 00:46 Hours UTC

mroeschke · 2018-11-30T00:37:26Z

Response to your questions:

During my is_utc overall, I added a test for a case where a dateutil UTC was getting coerced to a pytz UTC. IMO I think it's a bug if we coerce dateutil objects to pytz objects or vice versa so those test should be fixed. 94ce05d#diff-ad26614f192ef4be1906047f739c3644R592

1-sidenote) The interesting case becomes what happens if you mix dateutil and pytz objects (same timezone or not). My gut reaction is to cast to object, but I could see how raising could also be appropriate.

I agree that the second behavior is better than the first.

2b) Well the datetime64 will technically always be naive since its the numpy type? But I think more to your point, I think it should represent UTC inside array_to_datetime and should return UTC if timezone info was parsed.

mroeschke · 2018-11-30T01:05:08Z

pandas/_libs/tslib.pyx

+    return tz_cache_key(tz)
+
+
+cdef fixed_offset_to_pytz(tz):


As mentioned in my comment, I don't think we shouldn't be converting dateutil objects to pytz objects unless completely necessary.

I'm inclined to agree, am seeing now how many tests break if I change this.

What if we make get_key return "UTC" for [pytz, tzutc(), timezone.utc], and similarly map all FixedOffsets/tzoffsets to appropriate integers? Then because in this PR we're using a dict instead of a set, if two tzinfos are "equivalent" (i.e. produce the same key) then we will just end up using whatever was the last tzinfo with that key. At least in this subset of cases, I think the intention is Sufficiently Clear.

The downside is that to_datetime(arr) and to_datetime(arr[::-1]) could then have different tzinfos objects.

Just so I understand some implications:

3 timestamps with pytz.UTC, dateutil.tz.tzutc(), datetime.timezone.utc respectively would be coerced to one tz instance (the last one as you mention)

2 timestamps with US/Pacific and dateutil/US/Pacific respectively would have their tzs coerced to one tz instance (the last one as you mention)

In general I am not a fan of this coercion. I would prefer to maintain the individual timezones instances (or raise) over coercion.

It may be worth separately considering pd.to_datetime(...) vs pd.DatetimeIndex(...). The latter has a much stronger case for coercing equivalent tzinfos. For the former, mixing and matching is sufficiently weird that the user might be doing it intentionally.

Also note that the existing code coerces dateutil tzoffsets parsed from strings to pytz.FixedOffset. As a result, to_datetime(x) and to_datetime(Timestamp(x)) can have different tzinfos. Not quite the same situation, but similar.

coercing equivalent tzinfos

If we end up doing this then, I would prefer to pick a consistent timezone object (probably pytz) and not pick a psudo-random one (e.g. pick the last one).

In the case of mixed timezones though, would we coerce to object or raise?

I would prefer to pick a consistent timezone object (probably pytz) and not pick a psudo-random one (e.g. pick the last one).

So we would need some kind of hierarchy like:

def choose_utc(utcs): if pytz.utc in utcs: return pytz.utc elif PY3 and datetime.timezone.utc in utcs: return datetime.timezone.utc elif ... def choose_fixed_offset(fixed): ....

In the case of mixed timezones though, would we coerce to object or raise?

The approach I took in #23675 is to have objects_to_dt64ns have an allow_object kwarg that determines this.

Another (far in the future) option is to use a pandas-equivalent tzinfo as discussed here: #23959 (comment). But yeah I hierarchy would be needed in the short term.

Since objects_to_dt64ns is just used internally, in what instances are we choosing for allow_object to be True vs False?

mroeschke · 2018-11-30T01:10:44Z

pandas/_libs/tslib.pyx

@@ -617,6 +647,8 @@ cpdef array_to_datetime(ndarray[object] values, str errors='raise',
                    # A ValueError at this point is a _parsing_ error
                    # specifically _not_ OutOfBoundsDatetime
                    if _parse_today_now(val, &iresult[i]):
+                        # TODO: Do we treat this as local?


I feel like we should follow datetime.now and datetime.today conventions and return local for both. IIRC there was discussion somewhere around this issue?

There's #18705 suggesting to_datetime("now") should match Timestamp("now").

Consider two cases:

ts = pd.Timestamp('2018-11-29 18:02') ts2 = ts.tz_localize('US/Pacific') vals1 = [ts, 'now', 'today', ts.asm8] vals2 = [ts2, 'now', 'today', ts.asm8] dti1 = pd.to_datetime(vals1).tz_localize('US/Pacific') dti2 = pd.to_datetime(vals2) # raises in master assert dti1[0] == dti2[0] >>> dti1[1] - dti2[1] Timedelta('0 days 07:59:59.999065') >>> dti1[2] - dti2[2] Timedelta('0 days 07:59:59.998852') >>> dti1[3] - dti2[3] Timedelta('0 days 08:00:00')

I'm not wild about having the parsed meaning for the last three entries depending on first entry.

I do think that to_datetime("now") should match Timestamp("now"). I am not sure if I fully understand your example's point.

I am not sure if I fully understand your example's point.

The point is that to_datetime([a, b, c])[1:] should not depend on a (except possibly for whether it is object-dtype)

@mroeschke thoughts on this?

Okay yeah I agree with your example. One argument's timezone information shouldn't propagate to the other arguments. I would opt for vals2 to be cast to an Index with object dtype.

codecov · 2018-12-01T16:57:16Z

Codecov Report

Merging #24006 into master will decrease coverage by 0.57%.
The diff coverage is 55.81%.

@@            Coverage Diff             @@
##           master   #24006      +/-   ##
==========================================
- Coverage   43.02%   42.44%   -0.58%     
==========================================
  Files         162      161       -1     
  Lines       51700    51563     -137     
==========================================
- Hits        22245    21887     -358     
- Misses      29455    29676     +221

Flag	Coverage Δ
#single	`42.44% <55.81%> (-0.58%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/series.py	`50.74% <0%> (+1.41%)`	⬆️
pandas/core/dtypes/cast.py	`48.41% <100%> (-0.18%)`	⬇️
pandas/core/tools/datetimes.py	`34.68% <41.66%> (+0.63%)`	⬆️
pandas/util/testing.py	`51.88% <66.66%> (+0.26%)`	⬆️
pandas/io/json/json.py	`16.66% <0%> (-46.2%)`	⬇️
pandas/io/common.py	`38.75% <0%> (-4.66%)`	⬇️
pandas/core/arrays/sparse.py	`40.42% <0%> (-4.43%)`	⬇️
pandas/core/dtypes/common.py	`69.9% <0%> (-2.86%)`	⬇️
pandas/io/formats/format.py	`50.35% <0%> (-2.83%)`	⬇️
pandas/core/sparse/series.py	`43.75% <0%> (-2.68%)`	⬇️
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update defa8a8...ec35002. Read the comment docs.

codecov · 2018-12-01T16:57:16Z

Codecov Report

Merging #24006 into master will increase coverage by <.01%.
The diff coverage is 55.81%.

@@            Coverage Diff             @@
##           master   #24006      +/-   ##
==========================================
+ Coverage   42.46%   42.46%   +<.01%     
==========================================
  Files         161      161              
  Lines       51557    51558       +1     
==========================================
+ Hits        21892    21893       +1     
  Misses      29665    29665

Flag	Coverage Δ
#single	`42.46% <55.81%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/series.py	`50.74% <0%> (ø)`	⬆️
pandas/core/dtypes/cast.py	`48.41% <100%> (-0.02%)`	⬇️
pandas/core/tools/datetimes.py	`34.68% <41.66%> (-0.39%)`	⬇️
pandas/util/testing.py	`51.88% <66.66%> (+0.07%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b0610b...a2a01c9. Read the comment docs.

jreback · 2018-12-02T23:42:41Z

pandas/_libs/tslib.pyx

@@ -459,6 +461,26 @@ def array_with_unit_to_datetime(ndarray values, object unit,
    return oresult


+cdef get_key(tz):


can you type & add a doc-string

jreback · 2018-12-02T23:45:06Z

pandas/core/dtypes/cast.py

-
+            v, inferred_tz = tslib.array_to_datetime(v,
+                                                     require_iso8601=True,
+                                                     errors='raise')
        except Exception:


this still needed?

jreback · 2018-12-02T23:49:05Z

pandas/util/testing.py

+    if isinstance(left, DatetimeIndex):
+        # by now we know right is also a DatetimeIndex
+        assert_numpy_array_equal(left.asi8, right.asi8)
+        assert tz_compare(left.tz, right.tz)


why is any of these changes needed? this is really suspect. instead the tz comparison should be done in assert_index_equal (which IIRC) is already done, no?

This is very kludgy and all needs to be pushed up the stack. I tentatively think this particular check will be fixed by making Datetime64TZDtype.__eq__ use tz_compare, at which point the changes here won't be necessary.

jbrockmendel · 2018-12-03T19:38:32Z

pandas/_libs/tslibs/timezones.pyx

+    if is_fixed_offset(start) and is_fixed_offset(end):
+        start_seconds = get_fixed_offset_total_seconds(start)
+        end_seconds = get_fixed_offset_total_seconds(end)
+        return start_seconds == end_seconds


@mroeschke @jreback are we in agreement that two FixedOffsets of matching length should be considered equal?

this seems reasonsable

can u just compare the start == end ?

No:

>>> off1 = pytz.FixedOffset(420) >>> off2 = dateutil.tz.tzoffset(None, 420*60) >>> off1 == off2 False

ok that makes sense

Sounds reasonable.

If a user passes both a pytz.FixedOffset and a dateutil.tz.tzoffset will be coercing to one of the tzinfos? Once subtle point is if the dateutil.tz.tzoffset has a name but has the same offset as the pytz.FixedOffset, we should opt to keeping the dateutil instance so we don't drop the name.

WillAyd · 2019-02-27T23:51:48Z

@jbrockmendel still relevant?

jbrockmendel · 2019-02-27T23:53:06Z

still relevant?

Yes. I'm in a less-active phase at the moment, but this is still a problem that needs to be solved. I'll return to it before too long.

jreback · 2019-04-20T17:35:35Z

closing as stale

WillAyd · 2019-04-22T15:42:17Z

Closing per above comment. Can always be reopened

jbrockmendel added 3 commits November 29, 2018 14:26

WIP: fix array_to_datetime

13336d9

Merge branch 'master' of https://github.com/pandas-dev/pandas into a2d

4b80797

remove no-longer-needed

f0dccc7

mroeschke reviewed Nov 30, 2018

View reviewed changes

fix tests, make timezone funcs stricter

1b36e6f

gfyoung added Datetime Datetime data dtype Timezones Timezone data dtype labels Nov 30, 2018

jbrockmendel added 4 commits November 30, 2018 16:27

kludge the kludge

9d42d97

Merge branch 'master' of https://github.com/pandas-dev/pandas into a2d

515a23b

typo fixup

edc177d

Merge branch 'master' of https://github.com/pandas-dev/pandas into a2d

a2a01c9

This was referenced Dec 2, 2018

REF: Move non-raising parts of array_to_datetime outside of try/except #24032

Merged

REF: array_to_datetime catch overflows in one place #24049

Merged

jbrockmendel added 2 commits December 2, 2018 13:59

manual rebase

1c6a8ee

Merge branch 'master' of https://github.com/pandas-dev/pandas into a2d

9677010

jbrockmendel mentioned this pull request Dec 2, 2018

REF/DEPR: DatetimeIndex constructor #23675

Merged

Merge branch 'master' of https://github.com/pandas-dev/pandas into a2d

cf8b4cc

jreback requested changes Dec 2, 2018

View reviewed changes

jbrockmendel commented Dec 3, 2018

View reviewed changes

Merge branch 'master' of https://github.com/pandas-dev/pandas into a2d

ec35002

This was referenced Jan 2, 2019

DatetimeArray._from_sequence with mixture of tz-naive and tz-aware data. #24569

Closed

BUG: fix to_datetime failing to raise on mixed tznaive/tzaware datetimes #24663

Merged

jbrockmendel mentioned this pull request Jan 29, 2019

to_datetime uses previous row's timezone when timezone not specified and utc=True #24992

Closed

WillAyd closed this Apr 22, 2019

jbrockmendel deleted the a2d branch April 5, 2020 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: multi-timezone handling for array_to_datetime #24006

WIP: multi-timezone handling for array_to_datetime #24006

jbrockmendel commented Nov 29, 2018

pep8speaks commented Nov 29, 2018 •

edited

Loading

mroeschke commented Nov 30, 2018

mroeschke Nov 30, 2018

jbrockmendel Nov 30, 2018

mroeschke Nov 30, 2018

jbrockmendel Nov 30, 2018

mroeschke Nov 30, 2018

jbrockmendel Dec 1, 2018

mroeschke Dec 7, 2018 •

edited

Loading

mroeschke Nov 30, 2018

jbrockmendel Nov 30, 2018

mroeschke Nov 30, 2018

jbrockmendel Nov 30, 2018

jbrockmendel Dec 7, 2018

mroeschke Dec 7, 2018

codecov bot commented Dec 1, 2018 •

edited

Loading

codecov bot commented Dec 1, 2018

jreback Dec 2, 2018

jreback Dec 2, 2018

jreback Dec 2, 2018

jbrockmendel Dec 2, 2018

jbrockmendel Dec 3, 2018

jreback Dec 3, 2018

jbrockmendel Dec 3, 2018

jreback Dec 4, 2018

mroeschke Dec 7, 2018

WillAyd commented Feb 27, 2019

jbrockmendel commented Feb 27, 2019

jreback commented Apr 20, 2019

WillAyd commented Apr 22, 2019

		@@ -459,6 +461,26 @@ def array_with_unit_to_datetime(ndarray values, object unit,
		return oresult


		cdef get_key(tz):

WIP: multi-timezone handling for array_to_datetime #24006

WIP: multi-timezone handling for array_to_datetime #24006

Conversation

jbrockmendel commented Nov 29, 2018

pep8speaks commented Nov 29, 2018 • edited Loading

Comment last updated on December 09, 2018 at 00:46 Hours UTC

mroeschke commented Nov 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke Dec 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 1, 2018 • edited Loading

Codecov Report

codecov bot commented Dec 1, 2018

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Feb 27, 2019

jbrockmendel commented Feb 27, 2019

jreback commented Apr 20, 2019

WillAyd commented Apr 22, 2019

pep8speaks commented Nov 29, 2018 •

edited

Loading

mroeschke Dec 7, 2018 •

edited

Loading

codecov bot commented Dec 1, 2018 •

edited

Loading