Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/API: to_datetime preserves UTC offsets when parsing datetime strings #21822

Merged
merged 43 commits into from
Jul 30, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
ac5a3d1
BUG: to_datetime no longer converts offsets to UTC
Jul 7, 2018
b81a8e9
Merge remote-tracking branch 'upstream/master' into parse_tz_offsets
Jul 8, 2018
6bf46a8
Document and now return offset
Jul 8, 2018
678b337
Add some tests, start converting some existing uses of array_to_datetime
Jul 8, 2018
1917148
Add more tests
Jul 8, 2018
581a33e
Adjust test
Jul 8, 2018
a1bc8f9
Flake8
Jul 8, 2018
80042e6
Add tests confirming new behavior
Jul 8, 2018
7c4135e
Merge remote-tracking branch 'upstream/master' into parse_tz_offsets
Jul 10, 2018
0651416
Merge remote-tracking branch 'upstream/master' into parse_tz_offsets
Jul 11, 2018
bacb6e3
Lint
Jul 11, 2018
a2f4aad
adjust a test
Jul 11, 2018
d48f341
Ensure box object index, pass tests
Jul 11, 2018
7efb25c
Adjust tests
Jul 11, 2018
1d527ff
Adjust test
Jul 11, 2018
f89d6b6
Cleanup and add comments
Jul 12, 2018
d91c63f
address comments
Jul 12, 2018
1054e8b
adjust test to be closer to the original behavior
Jul 12, 2018
d9cdc91
Merge remote-tracking branch 'upstream/master' into parse_tz_offsets
Jul 12, 2018
7d04613
Add TypeError clause
Jul 12, 2018
031284c
Add TypeError not ValueError
Jul 12, 2018
749e62e
Merge remote-tracking branch 'upstream/master' into parse_tz_offsets
Jul 12, 2018
23cbf75
fix typo
Jul 12, 2018
1e6f87a
Merge remote-tracking branch 'upstream/master' into parse_tz_offsets
Jul 18, 2018
7539bcf
Merge remote-tracking branch 'upstream/master' into parse_tz_offsets
Jul 19, 2018
c1f51cd
New implimentation
Jul 19, 2018
db75a24
Merge remote-tracking branch 'upstream/master' into parse_tz_offsets
Jul 20, 2018
4733ac5
Change implimentation and add some tests
Jul 20, 2018
e3db735
Merge remote-tracking branch 'upstream/master' into parse_tz_offsets
Jul 20, 2018
2fa681f
Add missing commas
Jul 20, 2018
5f36c98
Merge remote-tracking branch 'upstream/master' into parse_tz_offsets
Jul 24, 2018
d7ff275
Change implimentation since tzoffsets cannot be hashed
Jul 25, 2018
4ff7cb3
Add whatsnew
Jul 25, 2018
8463d91
Address review
Jul 25, 2018
dddc6b3
Address tzlocal
Jul 25, 2018
cca3983
Change is to == for older dateutil compat
Jul 26, 2018
e441be0
Merge remote-tracking branch 'upstream/master' into parse_tz_offsets
Jul 26, 2018
a8a65f7
Modify example in whatsnew to display
Jul 26, 2018
75f9fd9
Merge remote-tracking branch 'upstream/master' into parse_tz_offsets
Jul 26, 2018
6052475
Add more specific errors
Jul 27, 2018
f916c69
Merge remote-tracking branch 'upstream/master' into parse_tz_offsets
Jul 28, 2018
807a251
Merge remote-tracking branch 'upstream/master' into parse_tz_offsets
Jul 29, 2018
1cbd9b9
Add some benchmarks and reformat tests
Jul 30, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions asv_bench/benchmarks/timeseries.py
Original file line number Diff line number Diff line change
Expand Up @@ -343,6 +343,25 @@ def time_iso8601_tz_spaceformat(self):
to_datetime(self.strings_tz_space)


class ToDatetimeNONISO8601(object):

goal_time = 0.2

def setup(self):
N = 10000
half = int(N / 2)
ts_string_1 = 'March 1, 2018 12:00:00+0400'
ts_string_2 = 'March 1, 2018 12:00:00+0500'
self.same_offset = [ts_string_1] * N
self.diff_offset = [ts_string_1] * half + [ts_string_2] * half

def time_same_offset(self):
to_datetime(self.same_offset)

def time_different_offset(self):
to_datetime(self.diff_offset)


class ToDatetimeFormat(object):

goal_time = 0.2
Expand Down
57 changes: 57 additions & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,62 @@ For situations where you need an ``ndarray`` of ``Interval`` objects, use
np.asarray(idx)
idx.values.astype(object)

.. _whatsnew_0240.api.timezone_offset_parsing:

Parsing Datetime Strings with Timezone Offsets
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Previously, parsing datetime strings with UTC offsets with :func:`to_datetime`
or :class:`DatetimeIndex` would automatically convert the datetime to UTC
without timezone localization. This is inconsistent from parsing the same
datetime string with :class:`Timestamp` which would preserve the UTC
offset in the ``tz`` attribute. Now, :func:`to_datetime` preserves the UTC
offset in the ``tz`` attribute when all the datetime strings have the same
UTC offset (:issue:`17697`, :issue:`11736`)

*Previous Behavior*:

.. code-block:: ipython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this directiive you just be python (rather than ipython here) @TomAugspurger ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at previous whatsnews, looks like either ipython or python works for this directive.



In [2]: pd.to_datetime("2015-11-18 15:30:00+05:30")
Out[2]: Timestamp('2015-11-18 10:00:00')

In [3]: pd.Timestamp("2015-11-18 15:30:00+05:30")
Out[3]: Timestamp('2015-11-18 15:30:00+0530', tz='pytz.FixedOffset(330)')

# Different UTC offsets would automatically convert the datetimes to UTC (without a UTC timezone)
In [4]: pd.to_datetime(["2015-11-18 15:30:00+05:30", "2015-11-18 16:30:00+06:30"])
Out[4]: DatetimeIndex(['2015-11-18 10:00:00', '2015-11-18 10:00:00'], dtype='datetime64[ns]', freq=None)

*Current Behavior*:

.. ipython:: python

pd.to_datetime("2015-11-18 15:30:00+05:30")
pd.Timestamp("2015-11-18 15:30:00+05:30")

Parsing datetime strings with the same UTC offset will preserve the UTC offset in the ``tz``

.. ipython:: python

pd.to_datetime(["2015-11-18 15:30:00+05:30"] * 2)

Parsing datetime strings with different UTC offsets will now create an Index of
``datetime.datetime`` objects with different UTC offsets

.. ipython:: python

idx = pd.to_datetime(["2015-11-18 15:30:00+05:30", "2015-11-18 16:30:00+06:30"])
idx
idx[0]
idx[1]

Passing ``utc=True`` will mimic the previous behavior but will correctly indicate
that the dates have been converted to UTC

.. ipython:: python
pd.to_datetime(["2015-11-18 15:30:00+05:30", "2015-11-18 16:30:00+06:30"], utc=True)

.. _whatsnew_0240.api.datetimelike.normalize:

Expand Down Expand Up @@ -439,6 +495,7 @@ Datetimelike

- Fixed bug where two :class:`DateOffset` objects with different ``normalize`` attributes could evaluate as equal (:issue:`21404`)
- Fixed bug where :meth:`Timestamp.resolution` incorrectly returned 1-microsecond ``timedelta`` instead of 1-nanosecond :class:`Timedelta` (:issue:`21336`,:issue:`21365`)
- Bug in :func:`to_datetime` that did not consistently return an :class:`Index` when ``box=True`` was specified (:issue:`21864`)

Timedelta
^^^^^^^^^
Expand Down
174 changes: 145 additions & 29 deletions pandas/_libs/tslib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import numpy as np
cnp.import_array()

import pytz
from dateutil.tz import tzlocal, tzutc as dateutil_utc


from util cimport (is_integer_object, is_float_object, is_string_object,
Expand Down Expand Up @@ -328,7 +329,7 @@ cpdef array_with_unit_to_datetime(ndarray values, unit, errors='coerce'):
if unit == 'ns':
if issubclass(values.dtype.type, np.integer):
return values.astype('M8[ns]')
return array_to_datetime(values.astype(object), errors=errors)
return array_to_datetime(values.astype(object), errors=errors)[0]

m = cast_from_unit(None, unit)

Expand Down Expand Up @@ -457,21 +458,58 @@ cpdef array_to_datetime(ndarray[object] values, errors='raise',
dayfirst=False, yearfirst=False,
format=None, utc=None,
require_iso8601=False):
"""
Converts a 1D array of date-like values to a numpy array of either:
1) datetime64[ns] data
2) datetime.datetime objects, if OutOfBoundsDatetime or TypeError
is encountered

Also returns a pytz.FixedOffset if an array of strings with the same
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle other tzinfos could be returned, specifically if it falls back to dateutil

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specifying that array_to_datetime function itself can return a pytz.FixedOffset or None from the C parser output. If it goes through the dateutil parser in the non-ValueError branch, I don't think there's a way to recover the timezone?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When parse_datetime_string is called if there's a timezone it should return a tz-aware datetime object, so the tzinfo can be pulled off that can't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right that's true, good catch. Sure I will try to add some tests and functionality to hit the dateutil parser before the object branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am using a set to store the parsed timezone offsets (which should be more performant that using an array in theory / I was having some trouble using an array due to duplicates), however dateutil.tz.tzoffsets cannot be hashed: dateutil/dateutil#792

So instead, I am saving the total_seconds() of the dateutil tzoffset in the set instead and reconstructing the offsets as pytz.FixedOffsets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add Parameters here

timezone offset is passed and utc=True is not passed. Otherwise, None
is returned

Handles datetime.date, datetime.datetime, np.datetime64 objects, numeric,
strings

Parameters
----------
values : ndarray of object
date-like objects to convert
errors : str, default 'raise'
error behavior when parsing
dayfirst : bool, default False
dayfirst parsing behavior when encountering datetime strings
yearfirst : bool, default False
yearfirst parsing behavior when encountering datetime strings
format : str, default None
format of the string to parse
utc : bool, default None
indicator whether the dates should be UTC
require_iso8601 : bool, default False
indicator whether the datetime string should be iso8601

Returns
-------
tuple (ndarray, tzoffset)
"""
cdef:
Py_ssize_t i, n = len(values)
object val, py_dt
object val, py_dt, tz, tz_out = None
ndarray[int64_t] iresult
ndarray[object] oresult
npy_datetimestruct dts
bint utc_convert = bool(utc)
bint seen_integer = 0
bint seen_string = 0
bint seen_datetime = 0
bint seen_datetime_offset = 0
bint is_raise = errors=='raise'
bint is_ignore = errors=='ignore'
bint is_coerce = errors=='coerce'
_TSObject _ts
int out_local=0, out_tzoffset=0
float offset_seconds
set out_tzoffset_vals = set()

# specify error conditions
assert is_raise or is_ignore or is_coerce
Expand Down Expand Up @@ -584,7 +622,7 @@ cpdef array_to_datetime(ndarray[object] values, errors='raise',
raise ValueError("time data {val} doesn't match "
"format specified"
.format(val=val))
return values
return values, tz_out

try:
py_dt = parse_datetime_string(val, dayfirst=dayfirst,
Expand All @@ -595,6 +633,30 @@ cpdef array_to_datetime(ndarray[object] values, errors='raise',
continue
raise TypeError("invalid string coercion to datetime")

# If the dateutil parser returned tzinfo, capture it
# to check if all arguments have the same tzinfo
tz = py_dt.tzinfo
if tz is not None:
seen_datetime_offset = 1
if tz == dateutil_utc():
# dateutil.tz.tzutc has no offset-like attribute
# Just add the 0 offset explicitly
out_tzoffset_vals.add(0)
elif tz == tzlocal():
# is comparison fails unlike other dateutil.tz
# objects. Also, dateutil.tz.tzlocal has no
# _offset attribute like tzoffset
offset_seconds = tz._dst_offset.total_seconds()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is using an internal method and is fragile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Suggestion for a better way to do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what this function is supposed to do, but from what I do understand it looks like a mistake.

Copy link
Member Author

@mroeschke mroeschke Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original intent with this section was to collect the dateutil timezone objects in a set and determine if more than 1 distinct timezone object was collected in the end.

However since dateutil timezones cannot be hashed (thanks for starting progress on the associated issue btw @pganssle), I instead store the timezone's UTC offsets in seconds instead (which is what I essentially care about).

It wasn't too apparent to me if there was a public way to access a dateutil timezone's UTC offset, hence why I am use a private method here. Definitely open to a better method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you can't find a public method you probably shouldn't dive into the private API. There are a few assumptions in here that won't survive future versions of dateutil for sure. The tzlocal thing also actually gives you the wrong answer, because tzlocal is not a fixed offset and you're taking the DST offset.

If you want the offset, you should do dt.utcoffset(), regardless of whether it's dateutil or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are significantly worried about speed, you can probably use the fact that in dateutil >= 2.7.0, dateutil.tz.tzutc() returns a singleton, and dateutil.tz.tzoffset(*args1) is the same object as any other dateutil.tz.tzoffset(*args2) where args1 == args2. So you can probably store a mapping between id(obj) and the result of obj.tzoffset(datetime(1970, 1, 1)) for that specific subclass (since it is also guaranteed to have no DST).

out_tzoffset_vals.add(offset_seconds)
else:
# dateutil.tz.tzoffset objects cannot be hashed
# store the total_seconds() instead
offset_seconds = tz._offset.total_seconds()
out_tzoffset_vals.add(offset_seconds)
else:
# Add a marker for naive string, to track if we are
# parsing mixed naive and aware strings
out_tzoffset_vals.add('naive')
try:
_ts = convert_datetime_to_tsobject(py_dt, None)
iresult[i] = _ts.value
Expand All @@ -614,8 +676,17 @@ cpdef array_to_datetime(ndarray[object] values, errors='raise',
# where we left off
value = dtstruct_to_dt64(&dts)
if out_local == 1:
seen_datetime_offset = 1
# Store the out_tzoffset in seconds
# since we store the total_seconds of
# dateutil.tz.tzoffset objects
out_tzoffset_vals.add(out_tzoffset * 60.)
tz = pytz.FixedOffset(out_tzoffset)
value = tz_convert_single(value, tz, 'UTC')
else:
# Add a marker for naive string, to track if we are
# parsing mixed naive and aware strings
out_tzoffset_vals.add('naive')
iresult[i] = value
try:
check_dts_bounds(&dts)
Expand All @@ -631,7 +702,7 @@ cpdef array_to_datetime(ndarray[object] values, errors='raise',
raise ValueError("time data {val} doesn't "
"match format specified"
.format(val=val))
return values
return values, tz_out
raise

else:
Expand All @@ -657,7 +728,21 @@ cpdef array_to_datetime(ndarray[object] values, errors='raise',
else:
raise TypeError

return result
if seen_datetime_offset and not utc_convert:
# GH 17697
# 1) If all the offsets are equal, return one offset for
# the parsed dates to (maybe) pass to DatetimeIndex
# 2) If the offsets are different, then force the parsing down the
# object path where an array of datetimes
# (with individual dateutil.tzoffsets) are returned
is_same_offsets = len(out_tzoffset_vals) == 1
if not is_same_offsets:
return array_to_datetime_object(values, is_raise,
dayfirst, yearfirst)
else:
tz_offset = out_tzoffset_vals.pop()
tz_out = pytz.FixedOffset(tz_offset / 60.)
return result, tz_out
except OutOfBoundsDatetime:
if is_raise:
raise
Expand All @@ -679,36 +764,67 @@ cpdef array_to_datetime(ndarray[object] values, errors='raise',
oresult[i] = val.item()
else:
oresult[i] = val
return oresult
return oresult, tz_out
except TypeError:
oresult = np.empty(n, dtype=object)
return array_to_datetime_object(values, is_raise, dayfirst, yearfirst)

for i in range(n):
val = values[i]
if checknull_with_nat(val):
oresult[i] = val
elif is_string_object(val):

if len(val) == 0 or val in nat_strings:
oresult[i] = 'NaT'
continue
cdef array_to_datetime_object(ndarray[object] values, bint is_raise,
dayfirst=False, yearfirst=False):
"""
Fall back function for array_to_datetime

try:
oresult[i] = parse_datetime_string(val, dayfirst=dayfirst,
yearfirst=yearfirst)
pydatetime_to_dt64(oresult[i], &dts)
check_dts_bounds(&dts)
except Exception:
if is_raise:
raise
return values
# oresult[i] = val
else:
Attempts to parse datetime strings with dateutil to return an array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add Parameters

of datetime objects

Parameters
----------
values : ndarray of object
date-like objects to convert
is_raise : bool
error behavior when parsing
dayfirst : bool, default False
dayfirst parsing behavior when encountering datetime strings
yearfirst : bool, default False
yearfirst parsing behavior when encountering datetime strings

Returns
-------
tuple (ndarray, None)
"""
cdef:
Py_ssize_t i, n = len(values)
object val,
ndarray[object] oresult
npy_datetimestruct dts

oresult = np.empty(n, dtype=object)

# We return an object array and only attempt to parse:
# 1) NaT or NaT-like values
# 2) datetime strings, which we return as datetime.datetime
for i in range(n):
val = values[i]
if checknull_with_nat(val):
oresult[i] = val
elif is_string_object(val):
if len(val) == 0 or val in nat_strings:
oresult[i] = 'NaT'
continue
try:
oresult[i] = parse_datetime_string(val, dayfirst=dayfirst,
yearfirst=yearfirst)
pydatetime_to_dt64(oresult[i], &dts)
check_dts_bounds(&dts)
except (ValueError, OverflowError):
if is_raise:
raise
return values

return oresult
return values, None
else:
if is_raise:
raise
return values, None
return oresult, None


cdef inline bint _parse_today_now(str val, int64_t* iresult):
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -918,7 +918,7 @@ def try_datetime(v):
# GH19671
v = tslib.array_to_datetime(v,
require_iso8601=True,
errors='raise')
errors='raise')[0]
except ValueError:

# we might have a sequence of the same-datetimes with tz's
Expand Down
Loading