-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: fix some of .clip() performance regression by using numpy arrays where possible #24735
Conversation
Codecov Report
@@ Coverage Diff @@
## master #24735 +/- ##
==========================================
+ Coverage 92.39% 92.39% +<.01%
==========================================
Files 166 166
Lines 52358 52362 +4
==========================================
+ Hits 48374 48378 +4
Misses 3984 3984
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24735 +/- ##
==========================================
+ Coverage 92.39% 92.39% +<.01%
==========================================
Files 166 166
Lines 52378 52382 +4
==========================================
+ Hits 48393 48398 +5
+ Misses 3985 3984 -1
Continue to review full report at Codecov.
|
pandas/core/generic.py
Outdated
|
||
with np.errstate(all='ignore'): | ||
if upper is not None: | ||
subset = self.values <= upper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, does this place nice when self is the new Int64
dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears to - let me know if there's a more extensive test you have in mind:
In [2]: s = pd.Series(range(5)).astype('Int64')
In [3]: s.clip(1, 3)
Out[3]: 0 1
1 1
2 2
3 3
4 3
dtype: Int64
result = result.where(subset, lower, axis=None, inplace=False) | ||
mask = isna(self.values) | ||
|
||
with np.errstate(all='ignore'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is simply reverting back to what this block used to do; it's needed in the event values <= upper
would otherwise raise a type error.
pandas/core/generic.py
Outdated
|
||
with np.errstate(all='ignore'): | ||
if upper is not None: | ||
subset = self.values <= upper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idiomatic approach here would now be to_numpy
instead of .values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@qwhelan can you update |
@qwhelan can you merge master and ping when passing |
thanks @qwhelan |
A recent change to respect dtypes in
.clip()
(#24458) introduced a decent overhead of ~2ms to the call:This PR cuts the overhead from ~2ms to ~0.6ms by keeping
subset
as a numpy array; it's entirely boolean regardless of underlying dtype, so a DataFrame only adds overhead here:git diff upstream/master -u -- "*.py" | flake8 --diff