Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add origin parameter to Timestamp/to_datetime epoch support. #11745

Closed
jtkiley opened this issue Dec 2, 2015 · 14 comments
Closed

Add origin parameter to Timestamp/to_datetime epoch support. #11745

jtkiley opened this issue Dec 2, 2015 · 14 comments
Labels
API Design Datetime Datetime data dtype
Milestone

Comments

@jtkiley
Copy link
Contributor

jtkiley commented Dec 2, 2015

When using SAS or Stata data, dates are represented as the number of days since 1/1/1960, and other statistical software uses different origin dates. With that in mind, it would be nice to have an origin date that can be specified. See also, #3969.

It's a relatively simple thing, and not hard to work around, of course. However, I end up dealing with date formatting on just about every data set I import, and I imagine that lots of others do, too.

Currently, I do something like this:

import pandas as pd
import datetime

EPOCH1960 = datetime.date(1970, 1, 1) - datetime.date(1960, 1, 1)

data = pd.read_stata('./data.dta')
data['date'] = pd.to_datetime(data['date'],unit='D') - EPOCH1960

In R, the as.Date() function takes an origin parameter for numeric types (see, manual). So, in R, the date part would simply be:

data$date <- as.Date(data$date, origin = '1960-01-01')
@jreback
Copy link
Contributor

jreback commented Dec 2, 2015

sure this could be a parameter to to_datetime (only) as that is the datetime converter
could work alongside the unit parameter (like #11276).

In fact that should be origin rather than unit now that I think about it. xref #11470

@jreback jreback added Datetime Datetime data dtype API Design labels Dec 2, 2015
@jreback jreback added this to the Next Major Release milestone Dec 2, 2015
@jreback jreback modified the milestones: 0.18.0, Next Major Release Dec 13, 2015
@jreback jreback modified the milestones: Next Major Release, 0.18.0 Feb 2, 2016
@jreback jreback modified the milestones: 0.18.1, Next Major Release Apr 20, 2016
@jreback jreback modified the milestones: 0.18.2, 0.18.1 May 1, 2016
jreback pushed a commit to jreback/pandas that referenced this issue May 1, 2016
@jreback jreback modified the milestones: 0.18.1, 0.18.2 May 1, 2016
@jreback jreback modified the milestones: Next Major Release, 0.18.2 Jul 6, 2016
@bashtage
Copy link
Contributor

bashtage commented Sep 6, 2016

@jtkiley Is read_stata not returning the correct dates by default?

@jtkiley
Copy link
Contributor Author

jtkiley commented Dec 16, 2016

It does not (pandas 0.19.1). Below, see one of my Stata datasets (originally written by R). In this example, I've formatted the date like my original post but without the epoch adjustment. I have some type stuff going on here that I'd fix in a real project, but you can see that the separate year variable is 10 years off from the date.

lpermno       date id_ticker    year  \
12060.0 2014-03-02        GE  2004.0   
86868.0 2016-11-05        GS  2006.0   
24643.0 2018-07-29        AA  2008.0   
NaN 2011-02-01       DAL  2001.0

Here's the same data in Stata, also without adjustment, though I formatted the date using format %td date.

lpermno	date	id_ticker	year
12060	01mar2004	GE	2004
86868	05nov2006	GS	2006
24643	28jul2008	AA	2008
.z	31jan2001	DAL	2001

@bashtage
Copy link
Contributor

@jtkiley Could you post a small DTA that demonstrated this issue?

@bashtage
Copy link
Contributor

@jtkiley Any chance for sharing a DTA with this issue?

@jtkiley
Copy link
Contributor Author

jtkiley commented Jan 20, 2017

Sorry for the delay. Here's one that I reduced down (columns and rows) to what you see above. It was originally written by R and then reduced and saved using Stata. It continues to exhibit this issue.

It's also zipped to make Github happy.
data_epoch.zip

@bashtage
Copy link
Contributor

I can't reproduce it. When I use read_stata, I get:

   lpermno     date id_ticker    year
0  12060.0  16131.0        GE  2004.0
1  86868.0  17110.0        GS  2006.0
2  24643.0  17741.0        AA  2008.0
3      NaN  15006.0       DAL  2001.0

which is identical to what Stata shows. When I convert date to a data column in Stata using format %td date and save it as `date_epoch_td.dta', reading this gets

   lpermno       date id_ticker    year
0  12060.0 2004-03-01        GE  2004.0
1  86868.0 2006-11-05        GS  2006.0
2  24643.0 2008-07-28        AA  2008.0
3      NaN 2001-01-31       DAL  2001.0

which seems to be correct.

@jtkiley
Copy link
Contributor Author

jtkiley commented Jan 20, 2017

@bashtage Right. The problem is when you convert the epoch time using data['date'] = pd.to_datetime(data['date'],unit='D'). If you use my adjustment above, it's right. If not, you end up with the results I showed above. The interpretation problem results from Stata using 1/1/1960 and pandas using 1/1/1970 as the base of epoch time.

@bashtage
Copy link
Contributor

@jtkiley I see. I thought it was a bug in read_stata. FWIW if you export your Stata dates as dates, and not integers/floats then read_stata will correctly use the 1960 epoch date when reading the data in.

@jtkiley
Copy link
Contributor Author

jtkiley commented Jan 20, 2017

@bashtage That makes sense. I was thinking of a to_datetime parameter for setting the origin, as Stata formats aren't the only place that this occurs, and R has such a parameter (presumably for the same reason).

I often see it when moving data around or pulling it from sources that have a Stata export option, and those often don't come with the date formatting intact. I tend to use those export options (often with Stata for co-author accessibility), assemble data in pandas (R in the past), and then export it in Stata format for sharing and analysis.

@jtkiley
Copy link
Contributor Author

jtkiley commented Jan 20, 2017

It looks like #11470 has the origin parameter basically done. It just has some work left finishing it up.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 28, 2017
@jreback
Copy link
Contributor

jreback commented Mar 28, 2017

@bashtage note that #15828 just provides the tools to deal with this, should this be an additional parameter to those reader? (is there actually meta data that tells you the origin?). Could certainly just start this as a doc / post-processing step in any event.

@bashtage
Copy link
Contributor

I don't see a strong reason to allow arbitrary offsets in the Stata interface code. The present version is very loyal to the Stata dta format spec and allowing a semi-random option to be internalized rather than chained seems like the wrong way to do things.

I suppose without explicit support one would have to do something like

pd.to_datetime(dates.astype(np.int64), origin='1-1-1960')

Maybe there would be an easier way to re-originate existing date-times.

@jreback
Copy link
Contributor

jreback commented Mar 28, 2017

@bashtage makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants