BUG/API: to_datetime preserves UTC offsets when parsing datetime strings #21822

mroeschke · 2018-07-08T22:28:14Z

closes Timestamp(foo) vs to_datetime(foo) #17697
closes Regression: DatetimeIndex ignores timezones #11736
closes API/BUG: to_datetime(..., box=True) should always return an Index #21864
precursor to fixing BUG: invalid construction from repr of dt-aware index #15938
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This PR makes to_datetime(ts_string_with_offset) and DatetimeIndex([ts_string_with_offset]) to now match Timestamp(ts_string_with_offset)

This branch

In [2]: Timestamp("2015-11-18 15:30:00+05:30")
Out[2]: Timestamp('2015-11-18 15:30:00+0530', tz='pytz.FixedOffset(330)')

In [3]: to_datetime("2015-11-18 15:30:00+05:30")
Out[3]: Timestamp('2015-11-18 15:30:00+0530', tz='pytz.FixedOffset(330)')

In [4]: DatetimeIndex(["2015-11-18 15:30:00+05:30"])
Out[4]: DatetimeIndex(['2015-11-18 15:30:00+05:30'], dtype='datetime64[ns, pytz.FixedOffset(330)]', freq=None)

jbrockmendel · 2018-07-09T03:26:19Z

pandas/_libs/tslib.pyx

+            if not is_same_offsets:
+                raise TypeError
+            else:
+                # Open question: should this return dateutil offset or pytz offset?


default to dateutil

For this PR then, is it okay that parsing through Timestamp will produce a pytz.FixedOffset and to_datetime will producea dateutil.tz.tzoffset?

I should probably start a larger discussion whether we should be migrating from pytz to dateutil

My mistake, forgot that Timestamp defaulted to pytz.FixedOffset. Sharing code with the Timestamp constructor is definitely a higher priority than defaulting to dateutil.tz.

jbrockmendel · 2018-07-09T03:27:01Z

pandas/core/tools/datetimes.py

                arg,
                errors=errors,
                utc=tz == 'utc',
                dayfirst=dayfirst,
                yearfirst=yearfirst,
                require_iso8601=require_iso8601
            )
+            if tz_parsed is not None:
+                return DatetimeIndex._simple_new(result, name=name,
+                                                 tz=tz_parsed)


case with multiple tzs that has to get wrapped in object-dtype?

That case will result in tz_parsed = None so this branch will not be hit.

jbrockmendel · 2018-07-09T03:54:05Z

Looking over the test failures, most of them are clearly the-test-is-wrong.

codecov · 2018-07-12T01:32:48Z

Codecov Report

Merging #21822 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21822      +/-   ##
==========================================
+ Coverage   92.07%   92.07%   +<.01%     
==========================================
  Files         170      170              
  Lines       50690    50696       +6     
==========================================
+ Hits        46672    46678       +6     
  Misses       4018     4018

Flag	Coverage Δ
#multiple	`90.48% <100%> (ø)`	⬆️
#single	`42.3% <60%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/dtypes/cast.py	`88.52% <ø> (ø)`	⬆️
pandas/core/tools/datetimes.py	`85.07% <100%> (+0.28%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d30c4a0...1cbd9b9. Read the comment docs.

jbrockmendel · 2018-07-12T02:09:13Z

pandas/_libs/tslib.pyx

+        2) datetime.datetime objects, if OutOfBoundsDatetime or TypeError
+           is encountered
+
+    Also returns a pytz.FixedOffset if an array of strings with the same


In principle other tzinfos could be returned, specifically if it falls back to dateutil

This is specifying that array_to_datetime function itself can return a pytz.FixedOffset or None from the C parser output. If it goes through the dateutil parser in the non-ValueError branch, I don't think there's a way to recover the timezone?

When parse_datetime_string is called if there's a timezone it should return a tz-aware datetime object, so the tzinfo can be pulled off that can't it?

Oh right that's true, good catch. Sure I will try to add some tests and functionality to hit the dateutil parser before the object branch.

So I am using a set to store the parsed timezone offsets (which should be more performant that using an array in theory / I was having some trouble using an array due to duplicates), however dateutil.tz.tzoffsets cannot be hashed: dateutil/dateutil#792

So instead, I am saving the total_seconds() of the dateutil tzoffset in the set instead and reconstructing the offsets as pytz.FixedOffsets

can you add Parameters here

jbrockmendel · 2018-07-12T02:09:31Z

pandas/_libs/tslib.pyx

+
+    Returns
+    -------
+    (ndarray, timezone offset)


underscore between timezone and offset?

jbrockmendel · 2018-07-12T02:11:29Z

pandas/_libs/tslib.pyx

+            #    the parsed dates to (maybe) pass to DatetimeIndex
+            # 2) If the offsets are different, then force the parsing down the
+            #    object path where an array of datetimes
+            #    (with individual datutil.tzoffsets) are returned


typo datutil

jbrockmendel · 2018-07-12T02:12:58Z

pandas/_libs/tslib.pyx

+            # Faster to compare integers than to compare objects
+            is_same_offsets = (out_tzoffset_vals[0] == out_tzoffset_vals).all()
+            if not is_same_offsets:
+                raise TypeError


This (pre-existing) pattern is pretty opaque to a first-time reader. What if instead of raising TypeError the fallback block became its own function that gets called from here?

jbrockmendel · 2018-07-12T02:14:06Z

pandas/tests/indexes/datetimes/test_tools.py

        assert is_datetime64_ns_dtype(i)

        # tz coerceion
        result = pd.to_datetime(i, errors='coerce', cache=cache)
        tm.assert_index_equal(result, i)

        result = pd.to_datetime(i, errors='coerce', utc=True, cache=cache)
-        expected = pd.DatetimeIndex(['2000-01-01 13:00:00'],
+        expected = pd.DatetimeIndex(['2000-01-01 08:00:00'],


reason for this change?

The +00:00 in i above required a tz_convert post construction while I think the original intention was to tz_localize it to the passed psycopg.tz in the constructor (leading to this change)

But to reflect the intension of the original test, I just removed the +00:00 instead

jbrockmendel · 2018-07-12T02:17:49Z

pandas/tests/tslibs/test_array_to_datetime.py

+
+        # TODO: Appears that parsing non-ISO strings adjust the date to UTC
+        # but don't return the offset. Not sure if this is the intended
+        # behavior of non-iso strings in np_datetime_strings


np_datetime_strings doesn't handle non-ISO. That case ends up going through dateutil (via parse_datetime_string). I'm not totally sure what the TODO is for. '01-01-2013 00:00:00' comes back tz-naive, right?

jbrockmendel · 2018-07-12T02:25:02Z

pandas/_libs/tslib.pyx

+            #    (with individual datutil.tzoffsets) are returned
+
+            # Faster to compare integers than to compare objects
+            is_same_offsets = (out_tzoffset_vals[0] == out_tzoffset_vals).all()


There may be a perf tradeoff here, specifically in the case where we have all-strings, all of which are ISO, but that don't have matching timezones. Going through the parse_datetime_string path below is much slower than _string_to_dts. Going through the python path entails a big hit.

The various paths (including require_iso8859 ugh) make this a giant hassle. @jreback one way to simplify this hassle would be to strengthen the requirement on require_iso8859. ATM it raises if it sees a non-ISO string, but is fine with datetime/np.datetime64 objects. If it were strings-only, then a bunch of logic could be simplified (not necessarily this PR). Thoughts?

jbrockmendel · 2018-07-12T02:25:43Z

A few comments, mostly about some tough perf tradeoffs. Generally this looks great; I'm really psyched to see this get fixed.

mroeschke · 2018-07-27T05:28:02Z

Addressed your comments @jreback & @jbrockmendel. Open for a final look.

jbrockmendel · 2018-07-27T20:06:32Z

pandas/_libs/tslib.pyx

+                                                   yearfirst=yearfirst)
+                pydatetime_to_dt64(oresult[i], &dts)
+                check_dts_bounds(&dts)
+            except Exception:


This could probably be more specific. parse_datetime_string could raise ValueError or OverflowError, check_dts_bounds raises OutOfBoundsDatetime (which subclasses ValueError), and I think thats it.

jbrockmendel · 2018-07-27T20:08:43Z

Does this make Timestamp("now") match to_datetime("now")? Or "today"?

mroeschke · 2018-07-27T23:30:10Z

This doesn't fix "now". ("today" looks like it was handled in #19937)

jreback · 2018-07-28T13:42:43Z

can you rebase

jreback

minor comments. @jbrockmendel any comments?

jreback · 2018-07-29T15:48:49Z

doc/source/whatsnew/v0.24.0.txt

+
+*Previous Behavior*:
+
+.. code-block:: ipython


I think this directiive you just be python (rather than ipython here) @TomAugspurger ?

Looking at previous whatsnews, looks like either ipython or python works for this directive.

jreback · 2018-07-29T15:51:21Z

pandas/tests/tslibs/test_array_to_datetime.py

@@ -123,11 +147,11 @@ def test_coerce_of_invalid_datetimes(self):

        # Without coercing, the presence of any invalid dates prevents
        # any values from being converted
-        result = tslib.array_to_datetime(arr, errors='ignore')
+        result = tslib.array_to_datetime(arr, errors='ignore')[0]


I like writing these like

result, _ = ...... I think its more clear

jreback · 2018-07-29T15:55:08Z

@mroeschke can you also run some timeseries benchmarks generally to see if any effects (I would expected some small slowdown for parsing but but not significant). You may need to add some benchmarks to capture the new code paths.

jbrockmendel · 2018-07-29T21:38:14Z

any comments?

Just that I’m pretty stoked to see this fixed.

mroeschke · 2018-07-30T00:20:07Z

From one new ASV I added, parsing strings with different offests will be slower, but I think it's acceptable given the new (more correct) behavior

       before           after         ratio
     [d30c4a06]       [807a2513]
+         1.17±0s       2.37±0.02s     2.02  timeseries.ToDatetimeNONISO8601.time_different_offset

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

jreback · 2018-07-30T10:30:06Z

ncie patch @mroeschke !

pganssle · 2018-08-01T14:30:36Z

pandas/_libs/tslib.pyx

+                            # is comparison fails unlike other dateutil.tz
+                            # objects. Also, dateutil.tz.tzlocal has no
+                            # _offset attribute like tzoffset
+                            offset_seconds = tz._dst_offset.total_seconds()


Note that this is using an internal method and is fragile.

Good point. Suggestion for a better way to do this?

I'm not sure I understand what this function is supposed to do, but from what I do understand it looks like a mistake.

My original intent with this section was to collect the dateutil timezone objects in a set and determine if more than 1 distinct timezone object was collected in the end.

However since dateutil timezones cannot be hashed (thanks for starting progress on the associated issue btw @pganssle), I instead store the timezone's UTC offsets in seconds instead (which is what I essentially care about).

It wasn't too apparent to me if there was a public way to access a dateutil timezone's UTC offset, hence why I am use a private method here. Definitely open to a better method.

I think if you can't find a public method you probably shouldn't dive into the private API. There are a few assumptions in here that won't survive future versions of dateutil for sure. The tzlocal thing also actually gives you the wrong answer, because tzlocal is not a fixed offset and you're taking the DST offset.

If you want the offset, you should do dt.utcoffset(), regardless of whether it's dateutil or not.

If you are significantly worried about speed, you can probably use the fact that in dateutil >= 2.7.0, dateutil.tz.tzutc() returns a singleton, and dateutil.tz.tzoffset(*args1) is the same object as any other dateutil.tz.tzoffset(*args2) where args1 == args2. So you can probably store a mapping between id(obj) and the result of obj.tzoffset(datetime(1970, 1, 1)) for that specific subclass (since it is also guaranteed to have no DST).

…ngs (pandas-dev#21822)

Matt Roeschke added 8 commits July 7, 2018 18:07

BUG: to_datetime no longer converts offsets to UTC

ac5a3d1

Merge remote-tracking branch 'upstream/master' into parse_tz_offsets

b81a8e9

Document and now return offset

6bf46a8

Add some tests, start converting some existing uses of array_to_datetime

678b337

Add more tests

1917148

Adjust test

581a33e

Flake8

a1bc8f9

Add tests confirming new behavior

80042e6

mroeschke added the Timezones label Jul 8, 2018

jbrockmendel reviewed Jul 9, 2018

View reviewed changes

Matt Roeschke added 5 commits July 9, 2018 23:12

Merge remote-tracking branch 'upstream/master' into parse_tz_offsets

7c4135e

Merge remote-tracking branch 'upstream/master' into parse_tz_offsets

0651416

Lint

bacb6e3

adjust a test

a2f4aad

Ensure box object index, pass tests

d48f341

mroeschke mentioned this pull request Jul 11, 2018

API/BUG: to_datetime(..., box=True) should always return an Index #21864

Closed

Matt Roeschke added 3 commits July 11, 2018 12:54

Adjust tests

7efb25c

Adjust test

1d527ff

Cleanup and add comments

f89d6b6

jbrockmendel reviewed Jul 12, 2018

View reviewed changes

pandas/_libs/tslib.pyx Outdated

Returns

-------

(ndarray, timezone offset)

Copy link

Member

jbrockmendel Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underscore between timezone and offset?

jbrockmendel reviewed Jul 12, 2018

View reviewed changes

Matt Roeschke added 6 commits July 25, 2018 10:22

Address review

8463d91

Address tzlocal

dddc6b3

Change is to == for older dateutil compat

cca3983

Merge remote-tracking branch 'upstream/master' into parse_tz_offsets

e441be0

Modify example in whatsnew to display

a8a65f7

Merge remote-tracking branch 'upstream/master' into parse_tz_offsets

75f9fd9

jbrockmendel reviewed Jul 27, 2018

View reviewed changes

Add more specific errors

6052475

Merge remote-tracking branch 'upstream/master' into parse_tz_offsets

f916c69

jreback approved these changes Jul 29, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Jul 29, 2018

Merge remote-tracking branch 'upstream/master' into parse_tz_offsets

807a251

Add some benchmarks and reformat tests

1cbd9b9

jreback merged commit 9a8cebc into pandas-dev:master Jul 30, 2018

mroeschke deleted the parse_tz_offsets branch July 30, 2018 15:22

pganssle reviewed Aug 1, 2018

View reviewed changes

mroeschke mentioned this pull request Aug 2, 2018

CLN: Use public method to capture UTC offsets #22164

Merged

dberenbaum pushed a commit to dberenbaum/pandas that referenced this pull request Aug 3, 2018

BUG/API: to_datetime preserves UTC offsets when parsing datetime stri…

00f6696

…ngs (pandas-dev#21822)

This was referenced Aug 21, 2018

BUG: to_datetime drops UTC offset when parsing datetime strings and box=False #22446

Closed

BUG: Retain timezone information in to_datetime if box=False #22457

Merged

mroeschke mentioned this pull request Aug 29, 2018

BUG: invalid construction from repr of dt-aware index #15938

Closed

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

BUG/API: to_datetime preserves UTC offsets when parsing datetime stri…

ae1d067

…ngs (pandas-dev#21822)

BUG/API: to_datetime preserves UTC offsets when parsing datetime strings #21822

BUG/API: to_datetime preserves UTC offsets when parsing datetime strings #21822

Conversation

mroeschke commented Jul 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Jul 9, 2018

codecov bot commented Jul 12, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Jul 12, 2018

mroeschke commented Jul 27, 2018

Choose a reason for hiding this comment

jbrockmendel commented Jul 27, 2018

mroeschke commented Jul 27, 2018

jreback commented Jul 28, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 29, 2018

jbrockmendel commented Jul 29, 2018

mroeschke commented Jul 30, 2018 • edited Loading

jreback commented Jul 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke Aug 1, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Jul 8, 2018 •

edited

Loading

codecov bot commented Jul 12, 2018 •

edited

Loading

mroeschke commented Jul 30, 2018 •

edited

Loading

mroeschke Aug 1, 2018 •

edited

Loading