-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add origin to to_datetime #15828
Conversation
ok, rebase and updated original PR. pls have a look. IIRC there were some more test cases needed, but dont' really remember |
doc/source/timeseries.rst
Outdated
|
||
pd.to_datetime([1, 2, 3], unit='D', origin=pd.Timestamp('1960-01-01')) | ||
|
||
The default is set at ``origin='epoch'``, which defaults to ``1970-01-01 00:00:00``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'epoch'
seems non-descriptive - in a sense, aren't all origins epochs? Maybe default should be 'unix'
like @shoyer had suggested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unix time (also known as POSIX time or epoch time) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly the unix epoch date is implied by "epoch time" but it's also a general term, that could refer to epoch timekeeping in general.
https://en.wikipedia.org/wiki/Epoch_(reference_date)#Computing
pandas/tseries/tools.py
Outdated
- If 'epoch', origin is set to 1970-01-01. | ||
- If 'julian', unit must be 'D', and origin is set to beginning of | ||
Julian Calendar. Julian day number 0 is assigned to the day starting | ||
at noon on January 1, 4713 BC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not critical, but could expand to other semi-common origins @bashtage mentions here.
#11470 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, maybe, can always add if people really want this.
Codecov Report
@@ Coverage Diff @@
## master #15828 +/- ##
==========================================
- Coverage 90.98% 90.95% -0.04%
==========================================
Files 143 143
Lines 49449 49464 +15
==========================================
- Hits 44993 44991 -2
- Misses 4456 4473 +17
Continue to review full report at Codecov.
|
ok changed |
pandas/tseries/tools.py
Outdated
infer_datetime_format : boolean, default False | ||
If True and no `format` is given, attempt to infer the format of the | ||
datetime strings, and if it can be inferred, switch to a faster | ||
method of parsing them. In some cases this can increase the parsing | ||
speed by ~5-10x. | ||
origin : scalar convertible to Timestamp / string ('julian', 'unix'), | ||
default 'unix'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that this does not work ... type explanation should be on one line (for good html docs) (ref numpy/numpydoc#87)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so pretty much make this shorter then?
pandas/tseries/tools.py
Outdated
@@ -297,8 +312,13 @@ def to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, | |||
>>> %timeit pd.to_datetime(s,infer_datetime_format=False) | |||
1 loop, best of 3: 471 ms per loop | |||
|
|||
""" | |||
Using non-epoch origins to parse date | |||
>>> pd.to_datetime([1,2,3], unit='D', origin=pd.Timestamp('1960-01-01')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line between the two lines above
pandas/tseries/tools.py
Outdated
"to a Timestamp".format(origin)) | ||
|
||
if offset is not None: | ||
result = result + offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I commented in the original PR (#11470 (comment)), I think it can solve some corner cases to do the origin handling before the actual parsing (as you do for the 'julian' case).
A bit an artificial example, but to show one of the problems:
In [11]: pd.to_datetime(200*365, unit='D')
Out[11]: Timestamp('2169-11-13 00:00:00')
In [12]: pd.to_datetime(200*365, unit='D', origin='1870-01-01')
Out[12]: Timestamp('2069-11-13 00:00:00')
In [13]: pd.to_datetime(300*365, unit='D', origin='1870-01-01')
...
OutOfBoundsDatetime: cannot convert input with unit 'D'
So the last one is actually not an OutOfBounds datetime, but just because we first parse the number as epoch, it raises this error. If we first subtract the correct value from the numeric argument, and then parse as epoch, the above will work fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed! this was tricky.
pandas/tseries/tools.py
Outdated
offset = tslib.Timestamp(origin) - tslib.Timestamp(0) | ||
except tslib.OutOfBoundsDatetime: | ||
raise ValueError( | ||
"origin {} is Out of Bounds".format(origin)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a todo item to solve this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what part, OutOfBounds dates generally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mean for origin specifically (by circumventing the Timestamp
creation), to have something like pd.to_datetime(.., origin='0001-01-01')
working if the end timestamp is not out of bounds.
Eg by using dateutil.parser.parse(origin)
, but then we still need to convert that to the correct offset
Some cases where something goes wrong:
|
doc/source/whatsnew/v0.20.0.txt
Outdated
to_datetime has gained an origin parameter | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
``pd.to_datetime`` has gained a new parameter, ``origin``, to define an offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use 'reference date' instead of 'offset' here as well (as you did in the docstring)
Here are the results with new push.
note [1], [2] (were correct before) |
pandas/tseries/tools.py
Outdated
origin : scalar convertible to Timestamp / string ('julian', 'unix'), | ||
default 'unix'. | ||
Define reference date. The numeric values would be parsed as number | ||
of units (defined by `unit`) since this reference date. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave this sentence in, it is a good explanation of what an origin is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New changes look good!
if np.any(arg > j_max) or np.any(arg < j_min): | ||
raise tslib.OutOfBoundsDatetime( | ||
"{original} is Out of Bounds for " | ||
"origin='julian'".format(original=original)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can have the same problem with a custom defined origin (not unix of julian)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but that is handled below (this is only for the julian section). I do raise the same error (w/o the julian reference)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif origin not in ['unix', 'julian']:
# arg must be a numeric
original = arg
if not ((is_scalar(arg) and (is_integer(arg) or is_float(arg))) or
is_numeric_dtype(np.asarray(arg))):
raise ValueError(
"'{arg}' is not compatible with origin='{origin}'; "
"it must be numeric with a unit specified ".format(
arg=arg,
origin=origin))
# we are going to offset back to unix / epoch time
try:
offset = tslib.Timestamp(origin) - tslib.Timestamp(0)
except tslib.OutOfBoundsDatetime:
raise tslib.OutOfBoundsDatetime(
"origin {} is Out of Bounds".format(origin))
except ValueError:
raise ValueError("origin {} cannot be converted "
"to a Timestamp".format(origin))
pandas/tseries/tools.py
Outdated
# this should be lossless in terms of precision | ||
offset = offset // tslib.Timedelta(1, unit=unit) | ||
|
||
arg = np.asarray(arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be simpler to check if it is a list and then convert to array? (or check that is is not a scalar/series/index)
Then the arg = arg + offset
does just what you want, and the conversion back as below is not needed
provide out-of-bounds for julian dates
@jorisvandenbossche make those changes and merged. thanks for the review. |
And thanks for picking up the PR! |
closes #11276
closes #11745
superseded #11470