-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF/BUG: use masked algo in groupby cummin and cummax #40651
PERF/BUG: use masked algo in groupby cummin and cummax #40651
Conversation
im unclear on what this means. can you rephrase? |
Sorry about that definitely confusing, updated comment |
["cummin", "cummax"], | ||
] | ||
|
||
def setup(self, dtype, method): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this costly? worth using setup_cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks to take ~0.25s. So might be worth caching, but appears setup_cache
can't be parameterized, so would have to ugly up the benchmark a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just make N // 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep have shrunk benchmark
val_is_nan = True | ||
out[i, j] = val | ||
|
||
if not val_is_nan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to implement this as a separate function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion about this. The question would be the tradeoff between a bit more complexity/branching vs duplication/increased package size (if we end up adding masked support to a lot more of these grouped algos)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any guess what the impact on package size is?
potential duplication might be addressed by making e.g. refactoring L1340-1347 or L 1302-1308 into helper functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on rough estimate, binaries generated from groupby.pyx take up ~5% of total _libs
. So based on the figure of _libs
taking up 17MB from #30741, cost of full duplication would be around 0.8-0.9 MB. But like you mentioned above, some duplication could be avoided, 1 MB should be an upper bound.
["cummin", "cummax"], | ||
] | ||
|
||
def setup(self, dtype, method): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just make N // 10
pandas/_libs/groupby.pyx
Outdated
labels : np.ndarray[np.intp] | ||
Labels to group by. | ||
ngroups : int | ||
Number of groups, larger than all entries of `labels`. | ||
is_datetimelike : bool | ||
True if `values` contains datetime-like entries. | ||
use_mask : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this actually worth it? we don't do this anywhere else. can you show head to head where you pass this always as True (vs your change method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to allow mask
to be None
, so use_mask
can be defined inside the function as use_mask = mask is not None
(a similar approach is used hashtable_class_helper.pxi for the unique/factorize implementations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, definitely a cleaner solution.
Addressing your comment @jreback, will take a look back at perf diff with forcing mask usage. IIRC the main reason for the optional mask was a slowdown in float
with the cost of isna
along with mask lookups in the algo ending up larger than existing null checking in cython.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem like as much of a slowdown as feared, ASV's with forced mask usage (from 09b73fc)
before after ratio
[e69df38c] [293dc6ea]
<master> <not_optional_mask>
324±8ms 277±10ms ~0.86 groupby.CumminMax.time_frame_transform('Float64', 'cummax')
326±10ms 294±10ms ~0.90 groupby.CumminMax.time_frame_transform('Float64', 'cummin')
- 692±8ms 393±10ms 0.57 groupby.CumminMax.time_frame_transform('Int64', 'cummax')
- 688±20ms 387±7ms 0.56 groupby.CumminMax.time_frame_transform('Int64', 'cummin')
368±7ms 372±20ms 1.01 groupby.CumminMax.time_frame_transform('float64', 'cummax')
382±3ms 380±20ms 0.99 groupby.CumminMax.time_frame_transform('float64', 'cummin')
- 598±5ms 536±3ms 0.90 groupby.CumminMax.time_frame_transform('int64', 'cummax')
606±4ms 575±30ms 0.95 groupby.CumminMax.time_frame_transform('int64', 'cummin')
- 645±6ms 266±6ms 0.41 groupby.CumminMax.time_frame_transform_many_nulls('Float64', 'cummax')
- 643±10ms 275±10ms 0.43 groupby.CumminMax.time_frame_transform_many_nulls('Float64', 'cummin')
- 734±7ms 259±6ms 0.35 groupby.CumminMax.time_frame_transform_many_nulls('Int64', 'cummax')
- 752±5ms 251±9ms 0.33 groupby.CumminMax.time_frame_transform_many_nulls('Int64', 'cummin')
364±6ms 382±10ms 1.05 groupby.CumminMax.time_frame_transform_many_nulls('float64', 'cummax')
362±7ms 378±8ms 1.04 groupby.CumminMax.time_frame_transform_many_nulls('float64', 'cummin')
1.14±0.02s 1.13±0.02s 0.99 groupby.CumminMax.time_frame_transform_many_nulls('int64', 'cummax')
1.14±0.02s 1.12±0.03s 0.98 groupby.CumminMax.time_frame_transform_many_nulls('int64', 'cummin')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this forced mask solution is cleaner because along with the simpler cython algo, it would eventually allow cleaning up some of the kludge around NaT/datetimelike. Which can lose information on edge cases with forced casts to float from
pandas/pandas/core/groupby/ops.py
Lines 673 to 677 in 4554635
elif is_integer_dtype(dtype): | |
# we use iNaT for the missing value on ints | |
# so pre-convert to guard this condition | |
if (values == iNaT).any(): | |
values = ensure_float64(values) |
pandas/core/groupby/ops.py
Outdated
mask = isna(values).copy() | ||
arr = values._data | ||
|
||
if is_integer_dtype(values.dtype) or is_bool_dtype(values.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this an entirely different funtion? pls integrate to existing infra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a different function because it's specific for MaskedArrays. Having this as a separate function is consistent with how it's currently implemented IMO (with a similar separate function for generic EAs).
It could also be another elif
check in _ea_wrap_cython_operation
, but it's not that it would result in less code or so, and since _ea_wrap_cython_operation
already gets quite complicated, I think this separate function is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an initial version I tried to fold this into _ea_wrap_cython_operation
, but thought this smaller function was a cleaner solution since the conditionals here can remain much simpler. While adding a function for just supporting masked cummax
, cummin
seems wasteful, this infrastructure should extend to more masked groupby algos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comments. This MUST integrate with the existing infrastructure (or refactor that). Duplicating is -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The whole point of this PR is to add a special handling for masked arrays, to ensure to pass through the mask to the groupby cython algo. This will always add some code (and the code in
_masked_ea_wrap_cython_operation
is specific to masked arrays). - IMO this is integrated with the existing infrastructure: it integrates nicely into the existing
_cython_operation
and follows the same pattern as we already have for_ea_wrap_cython_operation
If you don't like how the added code is structured right now, please do a concrete suggestion of how you would do it differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mzeitlin11 thanks for looking into this!
pandas/_libs/groupby.pyx
Outdated
labels : np.ndarray[np.intp] | ||
Labels to group by. | ||
ngroups : int | ||
Number of groups, larger than all entries of `labels`. | ||
is_datetimelike : bool | ||
True if `values` contains datetime-like entries. | ||
use_mask : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to allow mask
to be None
, so use_mask
can be defined inside the function as use_mask = mask is not None
(a similar approach is used hashtable_class_helper.pxi for the unique/factorize implementations)
pandas/core/groupby/ops.py
Outdated
mask = isna(values).copy() | ||
arr = values._data | ||
|
||
if is_integer_dtype(values.dtype) or is_bool_dtype(values.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a different function because it's specific for MaskedArrays. Having this as a separate function is consistent with how it's currently implemented IMO (with a similar separate function for generic EAs).
It could also be another elif
check in _ea_wrap_cython_operation
, but it's not that it would result in less code or so, and since _ea_wrap_cython_operation
already gets quite complicated, I think this separate function is good.
Not directly related to this PR, but something that I noticed in the implementation while reviewing this: the cummin/cummax "propagate" NAs:
But since this is a cumulative maximum, that doesn't make much sense to me (the third value is the max of [1.0, 3.0, NaN], which should be 3.0 in our skipna=True logic).
Does somebody know if this is an explicit design? Or whether this was discussed before? |
@jreback can you take another look at this? |
@jbrockmendel we have the intention, though. Our roadmap says on that topic: "We want to eventually make the new semantics the default." (at the level of abstraction of the "semantics", but currently for numeric dtypes that in practice means using masks) |
Don't want to take up any more reviewer time just yet - want to take a quick crack at seeing what #40651 (comment) might look like first. Wouldn't want everyone to spend a bunch of time reviewing this, then have it end up being made obsolete by a more general solution. |
(I personally think it makes more sense to merge this first, and then have a follow-up PR to refactor how things are called/passed in groupb/ops.py for masked arrays, focusing specifically on that without the cummin/cummax changes here. But leave it to @jbrockmendel and @jreback to state their preference) |
yep onboard here to merge this (needs a rebase for some conflicts). |
ditto |
have rebased |
pandas/core/groupby/ops.py
Outdated
|
||
if is_integer_dtype(arr.dtype) or is_bool_dtype(arr.dtype): | ||
# IntegerArray or BooleanArray | ||
arr = ensure_int_or_float(arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW im planning to kill off this function; for EAs this is always just arr.to_numpy(dtype="float64", na_value=np.nan)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing up - realized this whole condition can be simplified since we actually have an ndarray at this point
can you merge master once more (small conflict), but want to have CI run. |
thanks @mzeitlin11 happy to take followups on cleaning |
Potential first step towards #37493
ASV's (on
N = 5_000_000
instead ofN = 1_000_000
on the added benchmark to get more stable results):