-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: pd.to_datetime(infer_datetime_format=True) drops timezone #42068
BUG: pd.to_datetime(infer_datetime_format=True) drops timezone #42068
Conversation
pandas/tests/tslibs/test_parsing.py
Outdated
("%Y.%m.%d %H:%M:%S", True), | ||
("%Y%m%d %H:%M:%S", True), | ||
("%Y-%m-%dT%H:%M:%S", True), | ||
("%Y-%m-%dT%H:%M:%SZ", True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't appear like a valid strftime directive. To parse Z
, I think %z
needs to be used e.g.
In [9]: datetime.datetime.strptime("2020-05-01T18:04:03Z", "%Y-%m-%dT%H:%M:%S%z")
Out[9]: datetime.datetime(2020, 5, 1, 18, 4, 3, tzinfo=datetime.timezone.utc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment.
My fix was done to fit the output of guess_datetime_format
.
ex.
In [1]: from pandas._libs.tslibs.parsing import guess_datetime_format
...: guess_datetime_format('2020-05-01T18:04:03Z')
Out[1]: '%Y-%m-%dT%H:%M:%SZ'
But as you said, the format string like "%SZ" should be "%S%z"
So now I guess the guess_datetime_format
function also has some bugs.
pandas/_libs/tslibs/parsing.pyx
Outdated
@@ -825,6 +825,10 @@ def format_is_iso(f: str) -> bint: | |||
iso_template = '%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S.%f'.format | |||
excluded_formats = ['%Y%m%d', '%Y%m', '%Y'] | |||
|
|||
if (f is not None) and (f[-2:] in ["SZ", "fZ"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in my other comment, shouldn't %z
be checked in the iso_template
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I will revise my PR.
@mroeschke I revised my PR based on the review comments. ex. print(pd._libs.tslibs.parsing.guess_datetime_format('2020-05-01T18:04:03UTC'))
print(pd._libs.tslibs.parsing.guess_datetime_format('2020-05-01T18:04:03Z'))
print(pd._libs.tslibs.parsing.guess_datetime_format('2020-05-01T18:04:03+0000'))
# Before the fix
# %Y-%m-%dT%H:%M:%S%Z
# %Y-%m-%dT%H:%M:%SZ
# %Y-%m-%dT%H:%M:%S+%f
# After the fix
# %Y-%m-%dT%H:%M:%S%Z
# %Y-%m-%dT%H:%M:%S%z
# %Y-%m-%dT%H:%M:%S%z In the fix of |
pandas/_libs/tslibs/parsing.pyx
Outdated
@@ -825,6 +825,10 @@ def format_is_iso(f: str) -> bint: | |||
iso_template = '%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S.%f'.format | |||
excluded_formats = ['%Y%m%d', '%Y%m', '%Y'] | |||
|
|||
if (f is not None) and (f[-3:] in ["S%z", "S%Z", "f%z", "f%Z"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead could we just loop over the possible timezone directives in the section below?
e.g.
iso_template = '%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S.%f{tz_dir}'.format
...
for time_sep in ...:
for tz_dir in ["", "%z", "%Z"]:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment.
Just to confirm, a loop version will be the following codes.
(I checked this passes the test)
iso_template = '%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S{micro_or_tz}'.format
excluded_formats = ['%Y%m%d', '%Y%m', '%Y']
for date_sep in [' ', '/', '\\', '-', '.', '']:
for time_sep in [' ', 'T']:
for micro_or_tz in ['', '%z', '%Z', '.%f', '.%f%z', '.%f%Z']:
if (iso_template(date_sep=date_sep,
time_sep=time_sep,
micro_or_tz=micro_or_tz,
).startswith(f) and f not in excluded_formats):
My concern was that this increase a nest a little.
But this version might be better for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's the loop I'm talking about. I think this version is more explicit rather than checking for the presence of %z/Z
and then removing
@mroeschke Now the import pandas as pd
from dateutil.parser import parse as du_parse
def try_parse(dt_str):
try:
print(f"{dt_str:<26}", "->", du_parse(dt_str), "->", pd._libs.tslibs.parsing.guess_datetime_format(dt_str))
except Exception as e:
print(type(e), e)
print("->", pd._libs.tslibs.parsing.guess_datetime_format(dt_str))
try_parse("2011-12-30T00:00:00+9")
try_parse("2011-12-30T00:00:00+09")
try_parse("2011-12-30T00:00:00+0905")
try_parse("2011-12-30T00:00:00+9:5")
try_parse("2011-12-30T00:00:00+09:05")
try_parse("2011-12-30T00:00:00+9:005")
try_parse("2011-12-30T00:00:00+009:00")
try_parse("2011-12-30T00:00:00+090")
try_parse("2011-12-30T00:00:00+09:")
# Outputs:
#
# 2011-12-30T00:00:00+9 -> 2011-12-30 00:00:00+09:00 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+09 -> 2011-12-30 00:00:00+09:00 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+0905 -> 2011-12-30 00:00:00+09:05 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+9:5 -> 2011-12-30 00:00:00+09:05 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+09:05 -> 2011-12-30 00:00:00+09:05 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+9:005 -> 2011-12-30 00:00:00+09:05 -> %Y-%m-%dT%H:%M:%S%z
# 2011-12-30T00:00:00+009:00 -> 2011-12-30 00:00:00+09:00 -> %Y-%m-%dT%H:%M:%S%z
# <class 'dateutil.parser._parser.ParserError'> Unknown string format: 2011-12-30T00:00:00+090
# -> None
# <class 'dateutil.parser._parser.ParserError'> Unknown string format: 2011-12-30T00:00:00+09:
# -> None I would appreciate it if you could review this. |
pandas/_libs/tslibs/parsing.pyx
Outdated
# joined tokens, which is carried out at the final step of the function, | ||
# the offset part of the tokens must match the '%z' format like '+0900' | ||
# instead of ‘+09:00’. | ||
if (parsed_datetime.tzinfo is not None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Don't need the parenthesis around this if statement (and most of the other ones you have here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, redundant parentheses are removed.
pandas/_libs/tslibs/parsing.pyx
Outdated
if ( | ||
(len(tokens) > 3) | ||
and tokens[-1].isdigit() | ||
and (tokens[-2] == ':') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be safe, I think this should be any non-digit separator just in case some format using something different than a colon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skipped the check about the separator string. So no assumptions are made about the separator type.
If you think that the not tokens[-2].isdigit()
is better, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good. Just a few comments
@mroeschke Thank you for your advice. I revised the PR. Any comments are welcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work. LGTM
@mroeschke @jreback Thank you for your reviews.
@jreback As you say, the current fix is complex rather than what I thought first. Now I have no idea to simplify the fix of the "guess_datetime_format" without any downgrade of the functionality. But the fix might be overkill for the original bug. Ex. if parsed_datetime.tzinfo is not None:
if (
(len(tokens) > 0 and tokens[-1] == "Z")
or (len(tokens) > 1 and tokens[-2] in ("+", "-"))
):
return None The only user of the Bugs will happen when the (In the above code, I excluded the condition Is this acceptable as the alternative solution? |
@mroeschke @jreback I found a more straightforward way to handle timezone offset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice simplification. Just one question.
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -932,6 +932,7 @@ Timezones | |||
^^^^^^^^^ | |||
- Bug in different ``tzinfo`` objects representing UTC not being treated as equivalent (:issue:`39216`) | |||
- Bug in ``dateutil.tz.gettz("UTC")`` not being recognized as equivalent to other UTC-representing tzinfos (:issue:`39276`) | |||
- Bug in :func:`to_datetime` with ``infer_datetime_format=True`` failing to parse zero UTC offset (``Z``) correctly (:issue:`41047`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. This PR didn't make it into 1.3. Could you move this to 1.4.0 whatsnew?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke OK!
Should I merge the current master and then fix the whatsnew file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that would be great, thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed whatsnew entry.
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
@i-aki-y can you merge master. also can you run the to_datetime benchamarks to see if any change here |
Hi, @jreback @mroeschke I merged the current master branch.
OK, but what benchmark should I do? I have run the
And I got the following results summary:
Although some differences are reported, these are seem to be unrelated to the fix. I also wrote the following benchmark code specific for the fixes. import pandas as pd
rng = pd.date_range(start="1/1/2000", periods=20000, freq="H")
strings_tz_space = [
x.strftime("%Y-%m-%d %H:%M:%S") + "Z" for x in rng
]
fmts = [
"%Y-%m-%dT%H:%M:%S",
"%Y-%m-%dT%H:%M:%S%z",
"%Y-%m-%dT%H:%M:%S%Z",
"%Y-%m-%dT%H:%M:%S.%f",
"%Y-%m-%dT%H:%M:%S.%f%z",
"%Y-%m-%dT%H:%M:%S.%f%Z",
]
## to_datetime
%timeit pd.to_datetime(strings_tz_space)
%timeit pd.to_datetime(strings_tz_space, infer_datetime_format=True)
## For functions used internally.
%timeit [pd._libs.tslibs.parsing.guess_datetime_format(ts) for ts in strings_tz_space[:500]]
%timeit [pd._libs.tslibs.parsing.format_is_iso(fmt) for fmt in fmts] The results are here:
The PR version of the If any other investigation is needed, let me know. |
@i-aki-y can you add the additional cases to the existing asv's (e.g. the ones you list here #42068 (comment)); we care about the ping on green. |
@jreback I added a benchmark for |
thanks @i-aki-y very nice! keep em coming! |
In this PR, I try to fix issue #41047.
I would appreciate it if you could review this PR.
About the issue
In the UTC format, "Z" is used as a special timezone "+00:00".
But current
pd.to_datetime
function can not recognize the last "Z" when theinfer_datetime_format
isTrue
.Note that
pd.to_datetime
can properly interpret "Z" as "+00:00" when theinfer_datetime_format
is unspecified.The problem happens in the following codes of datetimes.py:
Here, the
_guess_datetime_format_for_array
returns a format string like '%Y-%m-%d %H:%M:%SZ' depending on the input strings.The resulting format is estimated by
format_is_iso
function.This result is used to judge whether the function specialized for iso format is used or not.
But current
format_is_iso
can not recognize the last "Z" in a format string, so it returnsFalse
if the format has "Z" regardless the format is valid iso8601 format or not.Because of this misjudgment, the wrong function (
_to_datetime_with_format
) is used for the iso string, and the timestamp that does not have timezone information is returned.About the fix
I fixed the format_is_iso function to recognize the last "Z" in the format string to solve this problem.