Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.to_datetime(infer_datetime_format=True) drops timezone #42068

Merged

Conversation

i-aki-y
Copy link
Contributor

@i-aki-y i-aki-y commented Jun 17, 2021

In this PR, I try to fix issue #41047.
I would appreciate it if you could review this PR.

About the issue

In the UTC format, "Z" is used as a special timezone "+00:00".
But current pd.to_datetime function can not recognize the last "Z" when the infer_datetime_format is True.
Note that pd.to_datetime can properly interpret "Z" as "+00:00" when the infer_datetime_format is unspecified.

The problem happens in the following codes of datetimes.py:

    if infer_datetime_format and format is None:
        format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)

    if format is not None:
        # There is a special fast-path for iso8601 formatted
        # datetime strings, so in those cases don't use the inferred
        # format because this path makes process slower in this
        # special case
        format_is_iso8601 = format_is_iso(format)
        if format_is_iso8601:
            require_iso8601 = not infer_datetime_format
            format = None

    if format is not None:
        res = _to_datetime_with_format(
            arg, orig_arg, name, tz, format, exact, errors, infer_datetime_format
        )
        if res is not None:
            return res

Here, the _guess_datetime_format_for_array returns a format string like '%Y-%m-%d %H:%M:%SZ' depending on the input strings.
The resulting format is estimated by format_is_iso function.
This result is used to judge whether the function specialized for iso format is used or not.
But current format_is_iso can not recognize the last "Z" in a format string, so it returns False if the format has "Z" regardless the format is valid iso8601 format or not.
Because of this misjudgment, the wrong function (_to_datetime_with_format) is used for the iso string, and the timestamp that does not have timezone information is returned.

About the fix

I fixed the format_is_iso function to recognize the last "Z" in the format string to solve this problem.

("%Y.%m.%d %H:%M:%S", True),
("%Y%m%d %H:%M:%S", True),
("%Y-%m-%dT%H:%M:%S", True),
("%Y-%m-%dT%H:%M:%SZ", True),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear like a valid strftime directive. To parse Z, I think %z needs to be used e.g.

In [9]: datetime.datetime.strptime("2020-05-01T18:04:03Z", "%Y-%m-%dT%H:%M:%S%z")
Out[9]: datetime.datetime(2020, 5, 1, 18, 4, 3, tzinfo=datetime.timezone.utc)

Copy link
Contributor Author

@i-aki-y i-aki-y Jun 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment.
My fix was done to fit the output of guess_datetime_format.

ex.

In [1]: from pandas._libs.tslibs.parsing import guess_datetime_format
   ...: guess_datetime_format('2020-05-01T18:04:03Z')
Out[1]: '%Y-%m-%dT%H:%M:%SZ'

But as you said, the format string like "%SZ" should be "%S%z"
So now I guess the guess_datetime_format function also has some bugs.

@@ -825,6 +825,10 @@ def format_is_iso(f: str) -> bint:
iso_template = '%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S.%f'.format
excluded_formats = ['%Y%m%d', '%Y%m', '%Y']

if (f is not None) and (f[-2:] in ["SZ", "fZ"]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in my other comment, shouldn't %z be checked in the iso_template?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I will revise my PR.

@jreback jreback added Bug Timezones Timezone data dtype labels Jun 18, 2021
@i-aki-y
Copy link
Contributor Author

i-aki-y commented Jun 19, 2021

@mroeschke I revised my PR based on the review comments.
Now format_is_iso function recognize "%z" and "%Z" format instead of "Z".
To fix the problem, I improved the guess_datetime_format to recognize some additional timezone patterns (see below).

ex.

print(pd._libs.tslibs.parsing.guess_datetime_format('2020-05-01T18:04:03UTC'))
print(pd._libs.tslibs.parsing.guess_datetime_format('2020-05-01T18:04:03Z'))
print(pd._libs.tslibs.parsing.guess_datetime_format('2020-05-01T18:04:03+0000'))

# Before the fix
# %Y-%m-%dT%H:%M:%S%Z
# %Y-%m-%dT%H:%M:%SZ
# %Y-%m-%dT%H:%M:%S+%f

# After the fix
# %Y-%m-%dT%H:%M:%S%Z
# %Y-%m-%dT%H:%M:%S%z
# %Y-%m-%dT%H:%M:%S%z

In the fix of guess_datetime_format, I have tried to make the fix to be simple but there may be some tricky parts.
If you have any suggestions on how to implement it better, I would appreciate your advice.

@@ -825,6 +825,10 @@ def format_is_iso(f: str) -> bint:
iso_template = '%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S.%f'.format
excluded_formats = ['%Y%m%d', '%Y%m', '%Y']

if (f is not None) and (f[-3:] in ["S%z", "S%Z", "f%z", "f%Z"]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead could we just loop over the possible timezone directives in the section below?
e.g.

iso_template = '%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S.%f{tz_dir}'.format
...
   for time_sep in ...:
       for tz_dir in ["", "%z", "%Z"]:
            ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment.
Just to confirm, a loop version will be the following codes.
(I checked this passes the test)

    iso_template = '%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S{micro_or_tz}'.format
    excluded_formats = ['%Y%m%d', '%Y%m', '%Y']

    for date_sep in [' ', '/', '\\', '-', '.', '']:
        for time_sep in [' ', 'T']:
            for micro_or_tz in ['', '%z', '%Z', '.%f', '.%f%z', '.%f%Z']:
                if (iso_template(date_sep=date_sep,
                                 time_sep=time_sep,
                                 micro_or_tz=micro_or_tz,
                                ).startswith(f) and f not in excluded_formats):

My concern was that this increase a nest a little.
But this version might be better for simplicity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's the loop I'm talking about. I think this version is more explicit rather than checking for the presence of %z/Z and then removing

@i-aki-y
Copy link
Contributor Author

i-aki-y commented Jun 23, 2021

@mroeschke Now the guess_datetime_format can parse “+hh:mm” type timezone offset.
For consistency, I added some patterns that may not be commonly used to make the function recognize timezone formats that can be parsed correctly by the default parser.
The following examples demonstrate parser results and inferred formats.

import pandas as pd
from dateutil.parser import parse as du_parse

def try_parse(dt_str):
    try:
        print(f"{dt_str:<26}", "->", du_parse(dt_str), "->", pd._libs.tslibs.parsing.guess_datetime_format(dt_str))
    except Exception as e:
        print(type(e), e)
        print("->", pd._libs.tslibs.parsing.guess_datetime_format(dt_str))

try_parse("2011-12-30T00:00:00+9")
try_parse("2011-12-30T00:00:00+09")
try_parse("2011-12-30T00:00:00+0905")
try_parse("2011-12-30T00:00:00+9:5")
try_parse("2011-12-30T00:00:00+09:05")
try_parse("2011-12-30T00:00:00+9:005")
try_parse("2011-12-30T00:00:00+009:00")
try_parse("2011-12-30T00:00:00+090")
try_parse("2011-12-30T00:00:00+09:")

# Outputs:
#
# 2011-12-30T00:00:00+9      -> 2011-12-30 00:00:00+09:00 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+09     -> 2011-12-30 00:00:00+09:00 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+0905   -> 2011-12-30 00:00:00+09:05 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+9:5    -> 2011-12-30 00:00:00+09:05 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+09:05  -> 2011-12-30 00:00:00+09:05 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+9:005  -> 2011-12-30 00:00:00+09:05 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+009:00 -> 2011-12-30 00:00:00+09:00 -> %Y-%m-%dT%H:%M:%S%z
# <class 'dateutil.parser._parser.ParserError'> Unknown string format: 2011-12-30T00:00:00+090
# -> None
# <class 'dateutil.parser._parser.ParserError'> Unknown string format: 2011-12-30T00:00:00+09:
# -> None

I would appreciate it if you could review this.

# joined tokens, which is carried out at the final step of the function,
# the offset part of the tokens must match the '%z' format like '+0900'
# instead of ‘+09:00’.
if (parsed_datetime.tzinfo is not None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Don't need the parenthesis around this if statement (and most of the other ones you have here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, redundant parentheses are removed.

if (
(len(tokens) > 3)
and tokens[-1].isdigit()
and (tokens[-2] == ':')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be safe, I think this should be any non-digit separator just in case some format using something different than a colon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skipped the check about the separator string. So no assumptions are made about the separator type.
If you think that the not tokens[-2].isdigit() is better, please let me know.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good. Just a few comments

@i-aki-y
Copy link
Contributor Author

i-aki-y commented Jun 24, 2021

@mroeschke Thank you for your advice. I revised the PR. Any comments are welcome.

@mroeschke mroeschke added this to the 1.3 milestone Jun 24, 2021
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work. LGTM

@simonjayhawkins simonjayhawkins mentioned this pull request Jun 25, 2021
@i-aki-y
Copy link
Contributor Author

i-aki-y commented Jun 25, 2021

@mroeschke @jreback Thank you for your reviews.

i appreciate that we have a bug, but this is WAY more complicated than before. can you simplify this at all

@jreback As you say, the current fix is complex rather than what I thought first.

Now I have no idea to simplify the fix of the "guess_datetime_format" without any downgrade of the functionality.

But the fix might be overkill for the original bug.
I think that the minimal fix for the function for this bug is returning None for the timezone offset.

Ex.

    if parsed_datetime.tzinfo is not None:
        if (
            (len(tokens) > 0 and tokens[-1] == "Z")
            or (len(tokens) > 1 and tokens[-2] in ("+", "-"))
        ):
            return None

The only user of the guess_datetime_format is the to_datetime function.
In the to_datetime function, if the guess_datetime_format correctly fails to parse and returns None, the input string is passed to the object_to_datetime64ns function. And then, the object_to_datetime64ns function can parse timezone.

Bugs will happen when the guess_datetime_format incorrectly parses and returns the wrong format.
So the strategy of returning None when the input has a timezone offset also works.

(In the above code, I excluded the condition or (len(tokens) > 3 and tokens[-4] in ("+", "-")) because this pattern correctly fails in the subsequent process.)

Is this acceptable as the alternative solution?

@jreback jreback modified the milestones: 1.3, 1.3.1, 1.4 Jun 25, 2021
@i-aki-y
Copy link
Contributor Author

i-aki-y commented Jun 30, 2021

@mroeschke @jreback I found a more straightforward way to handle timezone offset.
I would appreciate it if you could review this revision.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simplification. Just one question.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Aug 1, 2021
@@ -932,6 +932,7 @@ Timezones
^^^^^^^^^
- Bug in different ``tzinfo`` objects representing UTC not being treated as equivalent (:issue:`39216`)
- Bug in ``dateutil.tz.gettz("UTC")`` not being recognized as equivalent to other UTC-representing tzinfos (:issue:`39276`)
- Bug in :func:`to_datetime` with ``infer_datetime_format=True`` failing to parse zero UTC offset (``Z``) correctly (:issue:`41047`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. This PR didn't make it into 1.3. Could you move this to 1.4.0 whatsnew?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke OK!
Should I merge the current master and then fix the whatsnew file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would be great, thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed whatsnew entry.

@mroeschke mroeschke removed the Stale label Aug 6, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Sep 9, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@jreback
Copy link
Contributor

jreback commented Sep 10, 2021

@i-aki-y can you merge master.

also can you run the to_datetime benchamarks to see if any change here

@i-aki-y
Copy link
Contributor Author

i-aki-y commented Sep 13, 2021

Hi, @jreback @mroeschke I merged the current master branch.

also can you run the to_datetime benchamarks to see if any change here

OK, but what benchmark should I do?
If you say about the benchmark of asv described at the following docs,

https://pandas.pydata.org/docs/dev/development/contributing_codebase.html?highlight=bench#running-the-performance-test-suite

I have run the asv command for the inference.py because the asv_bench/inference.py has some to_datetime benchmarks.

$ asv continuous -f 1.1 master fix-infer_datetime_format-drop-timezone -b inference

And I got the following results summary:

        before           after         ratio
     [02fdbde3]       [267b971f]
     <master>         <fix-infer_datetime_format-drop-timezone>
-        379±20μs         331±10μs     0.87  inference.ToDatetimeCacheSmallCount.time_unique_date_strings(False, 500)
-      2.97±0.2ms      2.58±0.03ms     0.87  inference.ToNumeric.time_from_numeric_str('coerce')
-     1.36±0.07ms      1.06±0.02ms     0.78  inference.ToNumericDowncast.time_downcast('datetime64', 'float')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Although some differences are reported, these are seem to be unrelated to the fix.

I also wrote the following benchmark code specific for the fixes.

import pandas as pd
rng = pd.date_range(start="1/1/2000", periods=20000, freq="H")
strings_tz_space = [
    x.strftime("%Y-%m-%d %H:%M:%S") + "Z" for x in rng
]
fmts = [
    "%Y-%m-%dT%H:%M:%S",
    "%Y-%m-%dT%H:%M:%S%z",
    "%Y-%m-%dT%H:%M:%S%Z",
    "%Y-%m-%dT%H:%M:%S.%f",
    "%Y-%m-%dT%H:%M:%S.%f%z",
    "%Y-%m-%dT%H:%M:%S.%f%Z",
]

## to_datetime
%timeit pd.to_datetime(strings_tz_space)
%timeit pd.to_datetime(strings_tz_space, infer_datetime_format=True)

## For functions used internally.
%timeit [pd._libs.tslibs.parsing.guess_datetime_format(ts) for ts in strings_tz_space[:500]]
%timeit [pd._libs.tslibs.parsing.format_is_iso(fmt) for fmt in fmts]

The results are here:

%timeit pd.to_datetime(strings_tz_space)
# master: 7.78 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# PR    : 7.51 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# ratio : 0.96

%timeit pd.to_datetime(strings_tz_space, infer_datetime_format=True)
# master: 60.1 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# PR    : 8.77 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# ratio : 0.14

%timeit [pd._libs.tslibs.parsing.guess_datetime_format(ts) for ts in strings_tz_space[:500]]
# master: 174 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# PR    : 171 ms ± 4.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# ratio : 0.98

%timeit [pd._libs.tslibs.parsing.format_is_iso(fmt) for fmt in fmts]
# master: 59.1 µs ± 991 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# PR    : 298 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# ratio : 5.04

The PR version of the pd.to_datetime(..., infer_datetime_format=True) looks much faster than the master version.
This seems because the fix of format inference makes select the optimal parse function internally.
The format_is_iso 5x slow down, but I think this impact will be limited because this still be fast and the function called once for each input list, not per list item.

If any other investigation is needed, let me know.

@jreback
Copy link
Contributor

jreback commented Sep 13, 2021

@i-aki-y can you add the additional cases to the existing asv's (e.g. the ones you list here #42068 (comment)); we care about the to_datetime ones rather than the internal functions themselves.

ping on green.

@pep8speaks
Copy link

pep8speaks commented Sep 14, 2021

Hello @i-aki-y! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-09-14 04:50:23 UTC

@i-aki-y
Copy link
Contributor Author

i-aki-y commented Sep 14, 2021

@jreback I added a benchmark for to_datetime(..., infer_datetime_format=True) to the asv_bench/inference.py.

@jreback jreback merged commit 7073dd1 into pandas-dev:master Sep 14, 2021
@jreback
Copy link
Contributor

jreback commented Sep 14, 2021

thanks @i-aki-y very nice! keep em coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pd.to_datetime(infer_datetime_format=True) drops timezone
5 participants