Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add origin to to_datetime #15828

Closed
wants to merge 6 commits into from
Closed

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Mar 28, 2017

closes #11276
closes #11745

superseded #11470

@jreback
Copy link
Contributor Author

jreback commented Mar 28, 2017

@jorisvandenbossche @chris-b1

ok, rebase and updated original PR. pls have a look.

IIRC there were some more test cases needed, but dont' really remember


pd.to_datetime([1, 2, 3], unit='D', origin=pd.Timestamp('1960-01-01'))

The default is set at ``origin='epoch'``, which defaults to ``1970-01-01 00:00:00``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'epoch' seems non-descriptive - in a sense, aren't all origins epochs? Maybe default should be 'unix' like @shoyer had suggested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unix time (also known as POSIX time or epoch time) ?

https://en.wikipedia.org/wiki/Unix_time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly the unix epoch date is implied by "epoch time" but it's also a general term, that could refer to epoch timekeeping in general.

https://en.wikipedia.org/wiki/Epoch_(reference_date)#Computing

- If 'epoch', origin is set to 1970-01-01.
- If 'julian', unit must be 'D', and origin is set to beginning of
Julian Calendar. Julian day number 0 is assigned to the day starting
at noon on January 1, 4713 BC.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not critical, but could expand to other semi-common origins @bashtage mentions here.
#11470 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, maybe, can always add if people really want this.

@codecov
Copy link

codecov bot commented Mar 28, 2017

Codecov Report

Merging #15828 into master will decrease coverage by 0.03%.
The diff coverage is 91.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15828      +/-   ##
==========================================
- Coverage   90.98%   90.95%   -0.04%     
==========================================
  Files         143      143              
  Lines       49449    49464      +15     
==========================================
- Hits        44993    44991       -2     
- Misses       4456     4473      +17
Flag Coverage Δ
#multiple ?
#single ?
Impacted Files Coverage Δ
pandas/tseries/tdi.py 90.37% <100%> (+0.03%) ⬆️
pandas/tseries/tools.py 85.2% <90.24%> (+0.26%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/io/packers.py 87.57% <0%> (-0.94%) ⬇️
pandas/core/common.py 90.96% <0%> (-0.34%) ⬇️
pandas/io/html.py 84.48% <0%> (-0.33%) ⬇️
pandas/core/series.py 94.86% <0%> (-0.11%) ⬇️
pandas/core/frame.py 97.56% <0%> (-0.1%) ⬇️
pandas/compat/numpy/function.py 94.19% <0%> (-0.04%) ⬇️
pandas/core/indexing.py 94.08% <0%> (-0.03%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 67cc021...ebb4acd. Read the comment docs.

@jreback
Copy link
Contributor Author

jreback commented Mar 28, 2017

ok changed epoch -> unix and updated docs.

infer_datetime_format : boolean, default False
If True and no `format` is given, attempt to infer the format of the
datetime strings, and if it can be inferred, switch to a faster
method of parsing them. In some cases this can increase the parsing
speed by ~5-10x.
origin : scalar convertible to Timestamp / string ('julian', 'unix'),
default 'unix'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that this does not work ... type explanation should be on one line (for good html docs) (ref numpy/numpydoc#87)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so pretty much make this shorter then?

@@ -297,8 +312,13 @@ def to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False,
>>> %timeit pd.to_datetime(s,infer_datetime_format=False)
1 loop, best of 3: 471 ms per loop

"""
Using non-epoch origins to parse date
>>> pd.to_datetime([1,2,3], unit='D', origin=pd.Timestamp('1960-01-01'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line between the two lines above

"to a Timestamp".format(origin))

if offset is not None:
result = result + offset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I commented in the original PR (#11470 (comment)), I think it can solve some corner cases to do the origin handling before the actual parsing (as you do for the 'julian' case).

A bit an artificial example, but to show one of the problems:

In [11]: pd.to_datetime(200*365, unit='D')
Out[11]: Timestamp('2169-11-13 00:00:00')

In [12]: pd.to_datetime(200*365, unit='D', origin='1870-01-01')
Out[12]: Timestamp('2069-11-13 00:00:00')

In [13]: pd.to_datetime(300*365, unit='D', origin='1870-01-01')
...
OutOfBoundsDatetime: cannot convert input with unit 'D'

So the last one is actually not an OutOfBounds datetime, but just because we first parse the number as epoch, it raises this error. If we first subtract the correct value from the numeric argument, and then parse as epoch, the above will work fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed! this was tricky.

offset = tslib.Timestamp(origin) - tslib.Timestamp(0)
except tslib.OutOfBoundsDatetime:
raise ValueError(
"origin {} is Out of Bounds".format(origin))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a todo item to solve this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what part, OutOfBounds dates generally?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean for origin specifically (by circumventing the Timestamp creation), to have something like pd.to_datetime(.., origin='0001-01-01') working if the end timestamp is not out of bounds.

Eg by using dateutil.parser.parse(origin), but then we still need to convert that to the correct offset

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 29, 2017

Some cases where something goes wrong:

  • This should raise an error (if the origin is first processed and the argument adapted for it, then this will raise an error):

    In [2]: pd.to_datetime("2005-01-01", origin="1960-01-01")
    Out[2]: Timestamp('1995-01-01 00:00:00')
    
  • Not sure how this can return 12:00 as the hour? (EDIT -> this is probably just the definition? "starting at noon")

    In [4]: pd.to_datetime(2456658, origin='julian', unit='D')
    Out[4]: Timestamp('2013-12-31 12:00:00')
    
  • The error message in case of out of bounds here:

    In [3]: pd.to_datetime(1, origin="julian", unit='D')
    ....
    OutOfBoundsDatetime: cannot convert input -2440586.5 with the unit 'D'
    

    is a bit unfortunate (the "input -2440586.5"). But that is of course the consequence of processing the origin first and adapting the arg based on that, before passing to the actual parsing methods ... (so that's a clear drawback of this approach I suggested also for the other origins). Not sure whether this could easily be solved ..

to_datetime has gained an origin parameter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``pd.to_datetime`` has gained a new parameter, ``origin``, to define an offset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use 'reference date' instead of 'offset' here as well (as you did in the docstring)

@jreback
Copy link
Contributor Author

jreback commented Mar 29, 2017

Here are the results with new push.

In [1]: pd.to_datetime(200*365, unit='D')
Out[1]: Timestamp('2169-11-13 00:00:00')

In [2]: pd.to_datetime(200*365, unit='D', origin='1870-01-01')
Out[2]: Timestamp('2069-11-13 00:00:00')

In [3]: pd.to_datetime(300*365, unit='D', origin='1870-01-01')
Out[3]: Timestamp('2169-10-20 00:00:00')

In [4]: pd.to_datetime("2005-01-01", origin="1960-01-01")
ValueError: '2005-01-01' is not compatible with origin='1960-01-01'; it must be numeric with a unit specified 

In [5]: pd.to_datetime(2456658, origin='julian', unit='D')
Out[5]: Timestamp('2013-12-31 12:00:00')

In [6]: pd.to_datetime(1, origin="julian", unit='D')
OutOfBoundsDatetime: 1 is Out of Bounds for origin='julian'

note [1], [2] (were correct before)
[5] seems fine as well.

origin : scalar convertible to Timestamp / string ('julian', 'unix'),
default 'unix'.
Define reference date. The numeric values would be parsed as number
of units (defined by `unit`) since this reference date.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave this sentence in, it is a good explanation of what an origin is.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New changes look good!

if np.any(arg > j_max) or np.any(arg < j_min):
raise tslib.OutOfBoundsDatetime(
"{original} is Out of Bounds for "
"origin='julian'".format(original=original))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have the same problem with a custom defined origin (not unix of julian)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but that is handled below (this is only for the julian section). I do raise the same error (w/o the julian reference)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    elif origin not in ['unix', 'julian']:

        # arg must be a numeric
        original = arg
        if not ((is_scalar(arg) and (is_integer(arg) or is_float(arg))) or
                is_numeric_dtype(np.asarray(arg))):
            raise ValueError(
                "'{arg}' is not compatible with origin='{origin}'; "
                "it must be numeric with a unit specified ".format(
                    arg=arg,
                    origin=origin))

        # we are going to offset back to unix / epoch time
        try:
            offset = tslib.Timestamp(origin) - tslib.Timestamp(0)
        except tslib.OutOfBoundsDatetime:
            raise tslib.OutOfBoundsDatetime(
                "origin {} is Out of Bounds".format(origin))
        except ValueError:
            raise ValueError("origin {} cannot be converted "
                             "to a Timestamp".format(origin))

# this should be lossless in terms of precision
offset = offset // tslib.Timedelta(1, unit=unit)

arg = np.asarray(arg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be simpler to check if it is a list and then convert to array? (or check that is is not a scalar/series/index)

Then the arg = arg + offset does just what you want, and the conversion back as below is not needed

@jreback
Copy link
Contributor Author

jreback commented Apr 2, 2017

@jorisvandenbossche make those changes and merged. thanks for the review.

@jorisvandenbossche
Copy link
Member

And thanks for picking up the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add origin parameter to Timestamp/to_datetime epoch support. implementation of from_julian_date()
4 participants