BUG: pd.to_datetime(infer_datetime_format=True) drops timezone #42068

i-aki-y · 2021-06-17T11:07:52Z

closes BUG: pd.to_datetime(infer_datetime_format=True) drops timezone #41047
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

In this PR, I try to fix issue #41047.
I would appreciate it if you could review this PR.

About the issue

In the UTC format, "Z" is used as a special timezone "+00:00".
But current pd.to_datetime function can not recognize the last "Z" when the infer_datetime_format is True.
Note that pd.to_datetime can properly interpret "Z" as "+00:00" when the infer_datetime_format is unspecified.

The problem happens in the following codes of datetimes.py:

    if infer_datetime_format and format is None:
        format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)

    if format is not None:
        # There is a special fast-path for iso8601 formatted
        # datetime strings, so in those cases don't use the inferred
        # format because this path makes process slower in this
        # special case
        format_is_iso8601 = format_is_iso(format)
        if format_is_iso8601:
            require_iso8601 = not infer_datetime_format
            format = None

    if format is not None:
        res = _to_datetime_with_format(
            arg, orig_arg, name, tz, format, exact, errors, infer_datetime_format
        )
        if res is not None:
            return res

Here, the _guess_datetime_format_for_array returns a format string like '%Y-%m-%d %H:%M:%SZ' depending on the input strings.
The resulting format is estimated by format_is_iso function.
This result is used to judge whether the function specialized for iso format is used or not.
But current format_is_iso can not recognize the last "Z" in a format string, so it returns False if the format has "Z" regardless the format is valid iso8601 format or not.
Because of this misjudgment, the wrong function (_to_datetime_with_format) is used for the iso string, and the timestamp that does not have timezone information is returned.

About the fix

I fixed the format_is_iso function to recognize the last "Z" in the format string to solve this problem.

mroeschke · 2021-06-17T16:51:57Z

pandas/tests/tslibs/test_parsing.py

+        ("%Y.%m.%d %H:%M:%S", True),
+        ("%Y%m%d %H:%M:%S", True),
+        ("%Y-%m-%dT%H:%M:%S", True),
+        ("%Y-%m-%dT%H:%M:%SZ", True),


This doesn't appear like a valid strftime directive. To parse Z, I think %z needs to be used e.g.

In [9]: datetime.datetime.strptime("2020-05-01T18:04:03Z", "%Y-%m-%dT%H:%M:%S%z") Out[9]: datetime.datetime(2020, 5, 1, 18, 4, 3, tzinfo=datetime.timezone.utc)

Thank you for your comment.
My fix was done to fit the output of guess_datetime_format.

ex.

In [1]: from pandas._libs.tslibs.parsing import guess_datetime_format ...: guess_datetime_format('2020-05-01T18:04:03Z') Out[1]: '%Y-%m-%dT%H:%M:%SZ'

But as you said, the format string like "%SZ" should be "%S%z"
So now I guess the guess_datetime_format function also has some bugs.

mroeschke · 2021-06-17T16:52:47Z

pandas/_libs/tslibs/parsing.pyx

@@ -825,6 +825,10 @@ def format_is_iso(f: str) -> bint:
    iso_template = '%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S.%f'.format
    excluded_formats = ['%Y%m%d', '%Y%m', '%Y']

+    if (f is not None) and (f[-2:] in ["SZ", "fZ"]):


As mentioned in my other comment, shouldn't %z be checked in the iso_template?

I agree, I will revise my PR.

i-aki-y · 2021-06-19T03:48:01Z

@mroeschke I revised my PR based on the review comments.
Now format_is_iso function recognize "%z" and "%Z" format instead of "Z".
To fix the problem, I improved the guess_datetime_format to recognize some additional timezone patterns (see below).

ex.

print(pd._libs.tslibs.parsing.guess_datetime_format('2020-05-01T18:04:03UTC'))
print(pd._libs.tslibs.parsing.guess_datetime_format('2020-05-01T18:04:03Z'))
print(pd._libs.tslibs.parsing.guess_datetime_format('2020-05-01T18:04:03+0000'))

# Before the fix
# %Y-%m-%dT%H:%M:%S%Z
# %Y-%m-%dT%H:%M:%SZ
# %Y-%m-%dT%H:%M:%S+%f

# After the fix
# %Y-%m-%dT%H:%M:%S%Z
# %Y-%m-%dT%H:%M:%S%z
# %Y-%m-%dT%H:%M:%S%z

In the fix of guess_datetime_format, I have tried to make the fix to be simple but there may be some tricky parts.
If you have any suggestions on how to implement it better, I would appreciate your advice.

mroeschke · 2021-06-21T16:16:51Z

pandas/_libs/tslibs/parsing.pyx

@@ -825,6 +825,10 @@ def format_is_iso(f: str) -> bint:
    iso_template = '%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S.%f'.format
    excluded_formats = ['%Y%m%d', '%Y%m', '%Y']

+    if (f is not None) and (f[-3:] in ["S%z", "S%Z", "f%z", "f%Z"]):


Instead could we just loop over the possible timezone directives in the section below?
e.g.

iso_template = '%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S.%f{tz_dir}'.format ... for time_sep in ...: for tz_dir in ["", "%z", "%Z"]: ...

Thank you for your comment.
Just to confirm, a loop version will be the following codes.
(I checked this passes the test)

iso_template = '%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S{micro_or_tz}'.format excluded_formats = ['%Y%m%d', '%Y%m', '%Y'] for date_sep in [' ', '/', '\\', '-', '.', '']: for time_sep in [' ', 'T']: for micro_or_tz in ['', '%z', '%Z', '.%f', '.%f%z', '.%f%Z']: if (iso_template(date_sep=date_sep, time_sep=time_sep, micro_or_tz=micro_or_tz, ).startswith(f) and f not in excluded_formats):

My concern was that this increase a nest a little.
But this version might be better for simplicity.

Yes that's the loop I'm talking about. I think this version is more explicit rather than checking for the presence of %z/Z and then removing

pandas/_libs/tslibs/parsing.pyx

i-aki-y · 2021-06-23T05:01:03Z

@mroeschke Now the guess_datetime_format can parse “+hh:mm” type timezone offset.
For consistency, I added some patterns that may not be commonly used to make the function recognize timezone formats that can be parsed correctly by the default parser.
The following examples demonstrate parser results and inferred formats.

import pandas as pd
from dateutil.parser import parse as du_parse

def try_parse(dt_str):
    try:
        print(f"{dt_str:<26}", "->", du_parse(dt_str), "->", pd._libs.tslibs.parsing.guess_datetime_format(dt_str))
    except Exception as e:
        print(type(e), e)
        print("->", pd._libs.tslibs.parsing.guess_datetime_format(dt_str))

try_parse("2011-12-30T00:00:00+9")
try_parse("2011-12-30T00:00:00+09")
try_parse("2011-12-30T00:00:00+0905")
try_parse("2011-12-30T00:00:00+9:5")
try_parse("2011-12-30T00:00:00+09:05")
try_parse("2011-12-30T00:00:00+9:005")
try_parse("2011-12-30T00:00:00+009:00")
try_parse("2011-12-30T00:00:00+090")
try_parse("2011-12-30T00:00:00+09:")

# Outputs:
#
# 2011-12-30T00:00:00+9      -> 2011-12-30 00:00:00+09:00 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+09     -> 2011-12-30 00:00:00+09:00 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+0905   -> 2011-12-30 00:00:00+09:05 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+9:5    -> 2011-12-30 00:00:00+09:05 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+09:05  -> 2011-12-30 00:00:00+09:05 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+9:005  -> 2011-12-30 00:00:00+09:05 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+009:00 -> 2011-12-30 00:00:00+09:00 -> %Y-%m-%dT%H:%M:%S%z
# <class 'dateutil.parser._parser.ParserError'> Unknown string format: 2011-12-30T00:00:00+090
# -> None
# <class 'dateutil.parser._parser.ParserError'> Unknown string format: 2011-12-30T00:00:00+09:
# -> None

I would appreciate it if you could review this.

mroeschke · 2021-06-23T17:22:10Z

pandas/_libs/tslibs/parsing.pyx

+    # joined tokens, which is carried out at the final step of the function,
+    # the offset part of the tokens must match the '%z' format like '+0900'
+    # instead of ‘+09:00’.
+    if (parsed_datetime.tzinfo is not None):


Nit: Don't need the parenthesis around this if statement (and most of the other ones you have here).

Thanks, redundant parentheses are removed.

mroeschke · 2021-06-23T17:24:31Z

pandas/_libs/tslibs/parsing.pyx

+            if (
+                (len(tokens) > 3)
+                and tokens[-1].isdigit()
+                and (tokens[-2] == ':')


To be safe, I think this should be any non-digit separator just in case some format using something different than a colon

I skipped the check about the separator string. So no assumptions are made about the separator type.
If you think that the not tokens[-2].isdigit() is better, please let me know.

mroeschke

Looks mostly good. Just a few comments

i-aki-y · 2021-06-24T07:13:53Z

@mroeschke Thank you for your advice. I revised the PR. Any comments are welcome.

mroeschke

Awesome work. LGTM

pandas/_libs/tslibs/parsing.pyx

i-aki-y · 2021-06-25T13:32:51Z

@mroeschke @jreback Thank you for your reviews.

i appreciate that we have a bug, but this is WAY more complicated than before. can you simplify this at all

@jreback As you say, the current fix is complex rather than what I thought first.

Now I have no idea to simplify the fix of the "guess_datetime_format" without any downgrade of the functionality.

But the fix might be overkill for the original bug.
I think that the minimal fix for the function for this bug is returning None for the timezone offset.

Ex.

    if parsed_datetime.tzinfo is not None:
        if (
            (len(tokens) > 0 and tokens[-1] == "Z")
            or (len(tokens) > 1 and tokens[-2] in ("+", "-"))
        ):
            return None

The only user of the guess_datetime_format is the to_datetime function.
In the to_datetime function, if the guess_datetime_format correctly fails to parse and returns None, the input string is passed to the object_to_datetime64ns function. And then, the object_to_datetime64ns function can parse timezone.

Bugs will happen when the guess_datetime_format incorrectly parses and returns the wrong format.
So the strategy of returning None when the input has a timezone offset also works.

(In the above code, I excluded the condition or (len(tokens) > 3 and tokens[-4] in ("+", "-")) because this pattern correctly fails in the subsequent process.)

Is this acceptable as the alternative solution?

i-aki-y · 2021-06-30T09:01:31Z

@mroeschke @jreback I found a more straightforward way to handle timezone offset.
I would appreciate it if you could review this revision.

mroeschke

Nice simplification. Just one question.

pandas/_libs/tslibs/parsing.pyx

github-actions · 2021-08-01T00:03:00Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

mroeschke · 2021-08-01T04:22:39Z

doc/source/whatsnew/v1.3.0.rst

@@ -932,6 +932,7 @@ Timezones
 ^^^^^^^^^
 - Bug in different ``tzinfo`` objects representing UTC not being treated as equivalent (:issue:`39216`)
 - Bug in ``dateutil.tz.gettz("UTC")`` not being recognized as equivalent to other UTC-representing tzinfos (:issue:`39276`)
+- Bug in :func:`to_datetime` with ``infer_datetime_format=True`` failing to parse zero UTC offset (``Z``) correctly (:issue:`41047`)


Sorry for the delay. This PR didn't make it into 1.3. Could you move this to 1.4.0 whatsnew?

@mroeschke OK!
Should I merge the current master and then fix the whatsnew file?

Yes that would be great, thank you.

Fixed whatsnew entry.

github-actions · 2021-09-09T00:03:17Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jreback · 2021-09-10T14:03:18Z

@i-aki-y can you merge master.

also can you run the to_datetime benchamarks to see if any change here

i-aki-y · 2021-09-13T04:40:37Z

Hi, @jreback @mroeschke I merged the current master branch.

also can you run the to_datetime benchamarks to see if any change here

OK, but what benchmark should I do?
If you say about the benchmark of asv described at the following docs,

https://pandas.pydata.org/docs/dev/development/contributing_codebase.html?highlight=bench#running-the-performance-test-suite

I have run the asv command for the inference.py because the asv_bench/inference.py has some to_datetime benchmarks.

$ asv continuous -f 1.1 master fix-infer_datetime_format-drop-timezone -b inference

And I got the following results summary:

        before           after         ratio
     [02fdbde3]       [267b971f]
     <master>         <fix-infer_datetime_format-drop-timezone>
-        379±20μs         331±10μs     0.87  inference.ToDatetimeCacheSmallCount.time_unique_date_strings(False, 500)
-      2.97±0.2ms      2.58±0.03ms     0.87  inference.ToNumeric.time_from_numeric_str('coerce')
-     1.36±0.07ms      1.06±0.02ms     0.78  inference.ToNumericDowncast.time_downcast('datetime64', 'float')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Although some differences are reported, these are seem to be unrelated to the fix.

I also wrote the following benchmark code specific for the fixes.

import pandas as pd
rng = pd.date_range(start="1/1/2000", periods=20000, freq="H")
strings_tz_space = [
    x.strftime("%Y-%m-%d %H:%M:%S") + "Z" for x in rng
]
fmts = [
    "%Y-%m-%dT%H:%M:%S",
    "%Y-%m-%dT%H:%M:%S%z",
    "%Y-%m-%dT%H:%M:%S%Z",
    "%Y-%m-%dT%H:%M:%S.%f",
    "%Y-%m-%dT%H:%M:%S.%f%z",
    "%Y-%m-%dT%H:%M:%S.%f%Z",
]

## to_datetime
%timeit pd.to_datetime(strings_tz_space)
%timeit pd.to_datetime(strings_tz_space, infer_datetime_format=True)

## For functions used internally.
%timeit [pd._libs.tslibs.parsing.guess_datetime_format(ts) for ts in strings_tz_space[:500]]
%timeit [pd._libs.tslibs.parsing.format_is_iso(fmt) for fmt in fmts]

The results are here:

%timeit pd.to_datetime(strings_tz_space)
# master: 7.78 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# PR    : 7.51 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# ratio : 0.96

%timeit pd.to_datetime(strings_tz_space, infer_datetime_format=True)
# master: 60.1 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# PR    : 8.77 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# ratio : 0.14

%timeit [pd._libs.tslibs.parsing.guess_datetime_format(ts) for ts in strings_tz_space[:500]]
# master: 174 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# PR    : 171 ms ± 4.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# ratio : 0.98

%timeit [pd._libs.tslibs.parsing.format_is_iso(fmt) for fmt in fmts]
# master: 59.1 µs ± 991 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# PR    : 298 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# ratio : 5.04

The PR version of the pd.to_datetime(..., infer_datetime_format=True) looks much faster than the master version.
This seems because the fix of format inference makes select the optimal parse function internally.
The format_is_iso 5x slow down, but I think this impact will be limited because this still be fast and the function called once for each input list, not per list item.

If any other investigation is needed, let me know.

jreback · 2021-09-13T12:47:12Z

@i-aki-y can you add the additional cases to the existing asv's (e.g. the ones you list here #42068 (comment)); we care about the to_datetime ones rather than the internal functions themselves.

ping on green.

pep8speaks · 2021-09-14T04:35:39Z

Hello @i-aki-y! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-09-14 04:50:23 UTC

i-aki-y · 2021-09-14T06:36:13Z

@jreback I added a benchmark for to_datetime(..., infer_datetime_format=True) to the asv_bench/inference.py.

jreback · 2021-09-14T12:35:25Z

thanks @i-aki-y very nice! keep em coming!

i-aki-y added 2 commits June 17, 2021 18:40

fix to_datetime with infer_datetime_format drop tz

2b6e9f6

Fix fix reference issue number

09a5108

mroeschke reviewed Jun 17, 2021

View reviewed changes

jreback added Bug Timezones Timezone data dtype labels Jun 18, 2021

make guess_datetime_format parse more timezon pattern

1886b22

mroeschke reviewed Jun 21, 2021

View reviewed changes

pandas/_libs/tslibs/parsing.pyx Show resolved Hide resolved

make guess_datetime_format can parse offset with separator

309b5d9

mroeschke reviewed Jun 23, 2021

View reviewed changes

remove redundant parentheses and separator check

7198d51

mroeschke added this to the 1.3 milestone Jun 24, 2021

mroeschke approved these changes Jun 24, 2021

View reviewed changes

jreback requested changes Jun 25, 2021

View reviewed changes

pandas/_libs/tslibs/parsing.pyx Show resolved Hide resolved

simonjayhawkins mentioned this pull request Jun 25, 2021

RLS: 1.3 #40169

Closed

jreback modified the milestones: 1.3, 1.3.1, 1.4 Jun 25, 2021

Simplify timezone offset handling.

f42be84

mroeschke reviewed Jul 1, 2021

View reviewed changes

pandas/_libs/tslibs/parsing.pyx Show resolved Hide resolved

mroeschke approved these changes Jul 1, 2021

View reviewed changes

github-actions bot added the Stale label Aug 1, 2021

mroeschke reviewed Aug 1, 2021

View reviewed changes

mroeschke removed the Stale label Aug 6, 2021

i-aki-y added 2 commits August 7, 2021 09:10

Merge branch 'master' into fix-infer_datetime_format-drop-timezone

44dbf67

fix whatsnew entry

a031e39

mroeschke approved these changes Aug 7, 2021

View reviewed changes

github-actions bot added the Stale label Sep 9, 2021

lithomas1 requested a review from jreback September 9, 2021 22:28

lithomas1 added Needs Review and removed Stale labels Sep 9, 2021

Merge branch 'master' into fix-infer_datetime_format-drop-timezone

8033fe7

mroeschke removed the Needs Review label Sep 11, 2021

Merge branch 'master' into fix-infer_datetime_format-drop-timezone

267b971

mroeschke approved these changes Sep 13, 2021

View reviewed changes

Add to_datetime(,infer_datetime_format=True) bench

a5136ed

Fix style

0538167

jreback approved these changes Sep 14, 2021

View reviewed changes

jreback merged commit 7073dd1 into pandas-dev:master Sep 14, 2021

galipremsagar mentioned this pull request Apr 6, 2022

[BUG] to_timestamps will need to handle %z format rapidsai/cudf#10609

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.to_datetime(infer_datetime_format=True) drops timezone #42068

BUG: pd.to_datetime(infer_datetime_format=True) drops timezone #42068

i-aki-y commented Jun 17, 2021 •

edited

Loading

mroeschke Jun 17, 2021

i-aki-y Jun 18, 2021 •

edited

Loading

mroeschke Jun 17, 2021

i-aki-y Jun 18, 2021

i-aki-y commented Jun 19, 2021

mroeschke Jun 21, 2021

i-aki-y Jun 22, 2021

mroeschke Jun 22, 2021

i-aki-y commented Jun 23, 2021 •

edited

Loading

mroeschke Jun 23, 2021

i-aki-y Jun 24, 2021

mroeschke Jun 23, 2021

i-aki-y Jun 24, 2021

mroeschke left a comment

i-aki-y commented Jun 24, 2021

mroeschke left a comment

i-aki-y commented Jun 25, 2021

i-aki-y commented Jun 30, 2021

mroeschke left a comment

github-actions bot commented Aug 1, 2021

mroeschke Aug 1, 2021

i-aki-y Aug 5, 2021

mroeschke Aug 6, 2021

i-aki-y Aug 7, 2021

github-actions bot commented Sep 9, 2021

jreback commented Sep 10, 2021

i-aki-y commented Sep 13, 2021

jreback commented Sep 13, 2021

pep8speaks commented Sep 14, 2021 •

edited

Loading

i-aki-y commented Sep 14, 2021

jreback commented Sep 14, 2021

BUG: pd.to_datetime(infer_datetime_format=True) drops timezone #42068

BUG: pd.to_datetime(infer_datetime_format=True) drops timezone #42068

Conversation

i-aki-y commented Jun 17, 2021 • edited Loading

About the issue

About the fix

Choose a reason for hiding this comment

i-aki-y Jun 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

i-aki-y commented Jun 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

i-aki-y commented Jun 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

i-aki-y commented Jun 24, 2021

mroeschke left a comment

Choose a reason for hiding this comment

i-aki-y commented Jun 25, 2021

i-aki-y commented Jun 30, 2021

mroeschke left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 9, 2021

jreback commented Sep 10, 2021

i-aki-y commented Sep 13, 2021

jreback commented Sep 13, 2021

pep8speaks commented Sep 14, 2021 • edited Loading

Comment last updated at 2021-09-14 04:50:23 UTC

i-aki-y commented Sep 14, 2021

jreback commented Sep 14, 2021

i-aki-y commented Jun 17, 2021 •

edited

Loading

i-aki-y Jun 18, 2021 •

edited

Loading

i-aki-y commented Jun 23, 2021 •

edited

Loading

pep8speaks commented Sep 14, 2021 •

edited

Loading