Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: preserve object-dtype index when accessing DataFrame column / PERF: improve perf of Series fastpath constructor #42950

Conversation

jorisvandenbossche
Copy link
Member

This came up while discussing the CoW proof of concept (and need for item_cache / speed of creating a Series of a column). Those small changes gives a decent reduction of some overhead (~30% reduction):

arr = np.array([1, 2, 3])
idx = pd.Index(["a", "b", "c"])

pd.Series(arr, index=idx, name="name", fastpath=True)
In [5]: %timeit pd.Series(arr, index=idx, name="name", fastpath=True)
10.9 µs ± 202 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
7.48 µs ± 187 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- PR

@jorisvandenbossche jorisvandenbossche added Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance labels Aug 9, 2021
# labels may exceeds datetime bounds,
# or not be a DatetimeIndex
pass
if labels._is_all_dates:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this is being relied upon in the pytables code though ..

And looking into it further, the change I made here of course changes behaviour, as it no longer try to infer object dtype as datetime in the fastpath.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although that should probably be considered as a bug fix. With master:

In [1]: df = pd.DataFrame({'a': [1, 2, 3]}, index=pd.date_range("2012", periods=3).astype(object))

In [2]: df.index
Out[2]: Index([2012-01-01 00:00:00, 2012-01-02 00:00:00, 2012-01-03 00:00:00], dtype='object')

In [3]: df['a'].index
Out[3]: DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'], dtype='datetime64[ns]', freq=None)

where the index of the column-as-Series becomes different ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we deprecated this, then un-deprecated bc we didn't have good warning handling/suppression, and were going to try to re-deprecate before long

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reference (issue/PR) for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is #36697, but that's for the non-fastpath way, I think? While here it's specifically the fastpath that I am changing (which is not used when a user creates a Series directly themselves).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we did that i dont remember any distinction being made between fastpath vs non-fastpath

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Sep 11, 2021
@mroeschke
Copy link
Member

Appears this PR has been dormant for a while closing to clear the queue. Feel free to reopen when you have time to work on this PR and address the comment.

@mroeschke mroeschke closed this Oct 31, 2021
@mroeschke mroeschke removed the Stale label Nov 15, 2021
@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

@jorisvandenbossche not averse to these PRs but can you merge master & address any comments esp on the older ones

@jbrockmendel
Copy link
Member

Current thought: the if labels._is_all_dates is a PITA that we should rip out in 2.0 (we tried deprecating before and it was messy). Could we just avoid going through these paths by using _from_mgr in the relevant places?

@jorisvandenbossche
Copy link
Member Author

Could we just avoid going through these paths by using _from_mgr in the relevant places?

I suppose you mean as alternative to passing fastpath=True, right?

First, the Series(..) fastpath still sets a name as well (which is not included in _from_mgr, so that would make replacing it more cumbersome, or a name keyword could be added to _from_mgr of course)

Looking at where fastpath=True is currently used, this is limited to:

  • DataFrame._box_col_values (for constructing a Series for a DataFrame column) -> this uses _constructor_sliced, which needs to be called directly (we can't easily do _constructor_sliced._from_mgr for backwards compatibility, since _constructor is only required to be a callable, not necessarily a class)
  • DataFrame._series attribute (only used in testing?)
  • Series.take -> also uses _constructor (so similar issue as above)

So from that, I don't think it is super easy to change fastpath=True to _from_mgr

@jorisvandenbossche
Copy link
Member Author

BTW, I still need to add a test case for this (the output is on master, with this PR both will be object dtype):

In [1]: df = pd.DataFrame({'a': [1, 2, 3]}, index=pd.date_range("2012", periods=3).astype(object))

In [2]: df.index
Out[2]: Index([2012-01-01 00:00:00, 2012-01-02 00:00:00, 2012-01-03 00:00:00], dtype='object')

In [3]: df['a'].index
Out[3]: DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'], dtype='datetime64[ns]', freq=None)

And we need to decide if we see that as a bug fix.

@jbrockmendel
Copy link
Member

And we need to decide if we see that as a bug fix.

definitely looks like a bugfix, nice catch

@jreback
Copy link
Contributor

jreback commented Dec 5, 2021

can you rebase i think we merged some parts of this elsewhere

@jorisvandenbossche
Copy link
Member Author

i think we merged some parts of this elsewhere

Any more specific pointer?

@jorisvandenbossche jorisvandenbossche changed the title PERF: improve perf of Series fastpath constructor BUG: preserve object-dtype index when accessing DataFrame columns / PERF: improve perf of Series fastpath constructor Dec 6, 2021
@jorisvandenbossche jorisvandenbossche changed the title BUG: preserve object-dtype index when accessing DataFrame columns / PERF: improve perf of Series fastpath constructor BUG: preserve object-dtype index when accessing DataFrame column / PERF: improve perf of Series fastpath constructor Dec 6, 2021
Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM cc @jreback

@jbrockmendel
Copy link
Member

@jorisvandenbossche can you rebase

@jorisvandenbossche jorisvandenbossche added this to the 1.5 milestone Jan 21, 2022
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls also rebase

@@ -869,6 +869,7 @@ Indexing
- Bug in :meth:`DataFrame.loc.__setitem__` changing dtype when indexer was completely ``False`` (:issue:`37550`)
- Bug in :meth:`IntervalIndex.get_indexer_non_unique` returning boolean mask instead of array of integers for a non unique and non monotonic index (:issue:`44084`)
- Bug in :meth:`IntervalIndex.get_indexer_non_unique` not handling targets of ``dtype`` 'object' with NaNs correctly (:issue:`44482`)
- Bug in getting a column from a DataFrame with an object-dtype row index with datetime-like values: the resulting Series now preserves the exact object-dtype Index from the parent DataFrame (:issue:`42950`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 1.5

@jorisvandenbossche jorisvandenbossche merged commit 12475e0 into pandas-dev:main Jan 24, 2022
@jorisvandenbossche jorisvandenbossche deleted the perf-series-constructor-fastpath branch January 24, 2022 09:51
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants