-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DatetimeArray._from_sequence #24074
Conversation
Hello @jbrockmendel! Thanks for submitting the PR.
|
# go through _simple_new instead | ||
warnings.simplefilter("ignore") | ||
result = cls.__new__(cls, verify_integrity=False, **d) | ||
if "data" in d and not isinstance(d["data"], DatetimeIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the reasoning behind this change? I don't see why just DatetimeIndex would need to have the integrity verified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this we get a couple of failing pickle-based tests. tests.frame.test_block_internals.test_pickle and tests.series.test_timeseries.test_pickle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I've been fighting those too errors too. I suspect they were working before because DatetimeIndex accepts data=None
for range-based constructors, which we don't want for the arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is because you need to define __reduce__
on in ExtensionArray
. This should never be hit by a non-DTI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't. The issue is that this function calls DatetimeIndex.__new__
with verify_integrity=False
(since it is unpickling a previously-constructed DTI, integrity has presumably already been verified, so we can skip that somewhat-costly step), and the pickle-tested cases raise ValueError because when we try to verify their integrity they fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fixed by #24096
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's fix that one first then. this needs to be changed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but doesn't the fact that we're having to copy data over in #24096 seem disconnected from pickling?
Pickling turns out to be only tangentially related to the "real" problem. In the status quo, altering a datetime64tz column alters the DatetimeIndex that backs it, but doesn't set its freq
to None
. When that DataFrame is pickled and then unpickled, it tries to reconstruct that DatetimeIndex
, but is passing arguments that should raise ValueError
. ATM that gets side-stepped by passing verify_integrity=False
.
So the goal of #24096 is to not corrupt the DatetimeIndex in the first place, making verify_integrity=False
unnecessary.
That's still pretty roundabout. Any clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that does help.
Codecov Report
@@ Coverage Diff @@
## master #24074 +/- ##
==========================================
- Coverage 42.38% 42.38% -0.01%
==========================================
Files 161 161
Lines 51701 51691 -10
==========================================
- Hits 21914 21907 -7
+ Misses 29787 29784 -3
Continue to review full report at Codecov.
|
2 similar comments
Codecov Report
@@ Coverage Diff @@
## master #24074 +/- ##
==========================================
- Coverage 42.38% 42.38% -0.01%
==========================================
Files 161 161
Lines 51701 51691 -10
==========================================
- Hits 21914 21907 -7
+ Misses 29787 29784 -3
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24074 +/- ##
==========================================
- Coverage 42.38% 42.38% -0.01%
==========================================
Files 161 161
Lines 51701 51691 -10
==========================================
- Hits 21914 21907 -7
+ Misses 29787 29784 -3
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24074 +/- ##
==========================================
+ Coverage 92.2% 92.2% +<.01%
==========================================
Files 162 162
Lines 51729 51717 -12
==========================================
- Hits 47697 47686 -11
+ Misses 4032 4031 -1
Continue to review full report at Codecov.
|
Sounds good. |
pandas/core/arrays/datetimes.py
Outdated
# assume this data are epoch timestamps | ||
if data.dtype != _INT64_DTYPE: | ||
data = data.astype(np.int64, copy=False) | ||
subarr = data.view(_NS_DTYPE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be able to pull out
subarr = data.view(_NS_DTYPE)
of the if/else and just do it always (or maybe just do it in _simple_new), but this is for later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I tried originally, but it breaks the recently-implemented tests.arrays.test_datetimelike.test_from_array_keeps_base (#23956)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm sorry. FWIW, that was a hackish thing that I introduced to avoid segfaults. If the code in reduction code was doing the right thing, we may be able to remove that test / requirement. But I'm not sure what the right thing is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the purposes of this PR it sounds like the options are to either remove that test and un-indent t his line, or keep this line how it is. It sounds like the first option would cause problems in your DTA branch. I don't have a strong preference here.
if freq is None and hasattr(values, "freq"): | ||
# i.e. DatetimeArray, DatetimeIndex | ||
freq = values.freq | ||
@classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason you don't want to add verify_integrity
here (as maybe _verify_integrity=True
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've deprecated the kwarg in the DatetimeIndex constructor to get rid of it. In cases where verify_integrity is not needed, a different constructor (e.g. simple_new) should be used.
# go through _simple_new instead | ||
warnings.simplefilter("ignore") | ||
result = cls.__new__(cls, verify_integrity=False, **d) | ||
if "data" in d and not isinstance(d["data"], DatetimeIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is because you need to define __reduce__
on in ExtensionArray
. This should never be hit by a non-DTI.
On the function signature, it would be nice if it matched the interface, but that's not strictly necessary. Apparently, https://en.wikipedia.org/wiki/Liskov_substitution_principle is the "rule" here, and I don't think adding optional arguments like Fundamentally, I think this is because we're overloading With PeriodArray, we got around this with a top-level What do you think about moving this PR's current I see that |
definitely +1 on having these match
I'm on board with separating most of the method out into a I still think having (plus there's also
|
That's all totally fair.
Of those two, I'm not sure which is preferred. Some scattered observations
|
I'm about to push a new commit, the relevant changes being:
|
That test failed here, not in #24096. It was fixed by avoiding calling
I don't see how that will work, but pickling is an area where I frequently find new and exciting ways to screw up. My preference would be to merge this with the small |
# go through _simple_new instead | ||
warnings.simplefilter("ignore") | ||
result = cls.__new__(cls, verify_integrity=False, **d) | ||
if "data" in d and not isinstance(d["data"], DatetimeIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's fix that one first then. this needs to be changed here.
ok, rebase just in case here |
Rebased. Haven't reverted the _new_DatetimeIndex change because there is one other case that still fails: there is an overflow in _generate_range that we're not handling correctly. I'll address that in a separate PR before long. |
|
||
Raises | ||
------ | ||
TypeError : if both timezones are present but do not match | ||
TypeError : PeriodDType data is passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this explicity handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Via maybe_convert_dtype
thanks @jbrockmendel merging to unblock things. but as they say in boxing: keep it clean! hahah |
commit 28c61d770f6dfca6857fd0fa6979d4119a31129e Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Dec 6 12:18:19 2018 -0600 uncomment commit bae2e322523efc73a1344464f51611e2dc555ccb Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Dec 6 12:17:09 2018 -0600 maybe fixes commit 6cb4db05c9d6ceba3794096f0172cae5ed5f6019 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Dec 6 09:57:37 2018 -0600 we back commit d97ab57fb32cb23371169d9ed659ccfac34cfe45 Merge: a117de4 b78aa8d Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Dec 6 09:51:51 2018 -0600 Merge remote-tracking branch 'upstream/master' into disown-tz-only-rebased2 commit b78aa8d Author: gfyoung <gfyoung17+GitHub@gmail.com> Date: Thu Dec 6 07:18:44 2018 -0500 REF/TST: Add pytest idiom to reshape/test_tile (pandas-dev#24107) commit 2993b8e Author: gfyoung <gfyoung17+GitHub@gmail.com> Date: Thu Dec 6 07:17:55 2018 -0500 REF/TST: Add more pytest idiom to scalar/test_nat (pandas-dev#24120) commit b841374 Author: evangelineliu <hsiyinliu@gmail.com> Date: Wed Dec 5 18:21:46 2018 -0500 BUG: Fix concat series loss of timezone (pandas-dev#24027) commit 4ae63aa Author: jbrockmendel <jbrockmendel@gmail.com> Date: Wed Dec 5 14:44:50 2018 -0800 Implement DatetimeArray._from_sequence (pandas-dev#24074) commit 2643721 Author: jbrockmendel <jbrockmendel@gmail.com> Date: Wed Dec 5 14:43:45 2018 -0800 CLN: Follow-up to pandas-dev#24100 (pandas-dev#24116) commit 8ea7744 Author: chris-b1 <cbartak@gmail.com> Date: Wed Dec 5 14:21:23 2018 -0600 PERF: ascii c string functions (pandas-dev#23981) commit cb862e4 Author: jbrockmendel <jbrockmendel@gmail.com> Date: Wed Dec 5 12:19:46 2018 -0800 BUG: fix mutation of DTI backing Series/DataFrame (pandas-dev#24096) commit aead29b Author: topper-123 <contribute@tensortable.com> Date: Wed Dec 5 19:06:00 2018 +0000 API: rename MultiIndex.labels to MultiIndex.codes (pandas-dev#23752)
Removes dependence of
DatetimeArray.__new__
onDatetimeIndex
. De-duplicatedDatetimeIndex.__new__
/DatetimeArray.__new__
.The contents of
DatetimeArray._from_sequence
are basically just moved fromDatetimeIndex.__new__
. This is feasible because #23675 disentangledto_datetime
fromDatetimeIndex.__new__
.cc @TomAugspurger this is the last thing on my todo list for DTA/TDA. LMK if I can be helpful with the composition transition.