-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: masked ops for reductions (sum) #30982
PERF: masked ops for reductions (sum) #30982
Conversation
Looks promising |
I updated this to finalize the "sum" case (basically added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks quite nice. I also think it's worth doing these separately, rather than trying to shoehorn it into our current nanops functions (which I've tried and failed to do a couple times).
pandas/core/ops/mask_ops.py
Outdated
Numpy array with the values (can be of any dtype that support the | ||
operation). | ||
mask : np.ndarray | ||
Boolean numpy array (False for missing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "False for missing" mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure, as it should actually be "True for missing" if it is to explain that the True values in the mask are missing values (maybe I was first planning to pass an inversed mask).
Will update that.
More comments? |
pandas/core/ops/mask_ops.py
Outdated
return np.sum(values, where=~mask) | ||
|
||
|
||
def _below_min_count(values, mask, min_count): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already do this in the nanops sum yes?
can u de duplicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's here:
Lines 1238 to 1271 in 37a7006
def _maybe_null_out( | |
result: np.ndarray, | |
axis: Optional[int], | |
mask: Optional[np.ndarray], | |
shape: Tuple, | |
min_count: int = 1, | |
) -> float: | |
""" | |
Returns | |
------- | |
Dtype | |
The product of all elements on a given axis. ( NaNs are treated as 1) | |
""" | |
if mask is not None and axis is not None and getattr(result, "ndim", False): | |
null_mask = (mask.shape[axis] - mask.sum(axis) - min_count) < 0 | |
if np.any(null_mask): | |
if is_numeric_dtype(result): | |
if np.iscomplexobj(result): | |
result = result.astype("c16") | |
else: | |
result = result.astype("f8") | |
result[null_mask] = np.nan | |
else: | |
# GH12941, use None to auto cast null | |
result[null_mask] = None | |
elif result is not NaT: | |
if mask is not None: | |
null_mask = mask.size - mask.sum() | |
else: | |
null_mask = np.prod(shape) | |
if null_mask < min_count: | |
result = np.nan | |
return result |
But I would prefer not to deduplicate, as the one in nanops is mixed with other logic / more complex because it needs to handle more dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disagree this is more code to maintain which keeps happening with ios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, what's "with ios" ?
pandas/core/ops/mask_ops.py
Outdated
if _np_version_under1p17: | ||
return np.sum(values[~mask]) | ||
else: | ||
return np.sum(values, where=~mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, is this more efficient than the version used in older numpy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. With the array in my top post it gives only a slight boost (a 5-8% speed-up), but it varies quite a bit (I could also have it slower with more np.nan values).
But when going to larger arrays, there seems to be a clearer difference:
In [40]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]*1_000_000)
In [41]: mask = np.isnan(a)
In [42]: %timeit np.sum(a, where=~mask)
17.7 ms ± 861 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [43]: %timeit np.sum(a[~mask])
43.3 ms ± 4.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Although the difference gets smaller when there are more missing values (eg with 50% missing values instead of 10% in the above, it becomes 25 vs 32ms).
But I assume there is also a memory benefit (avoiding the temporary array), although I am not fully familiar with the inner workings of this in numpy.
@@ -533,13 +533,14 @@ def test_sum_inf(self): | |||
res = nanops.nansum(arr, axis=1) | |||
assert np.isinf(res).all() | |||
|
|||
@pytest.mark.parametrize("dtype", ["float64", "Int64", "boolean"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should object be included here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added object dtype
pandas/core/ops/mask_ops.py
Outdated
The required number of valid values to perform the operation. If fewer than | ||
``min_count`` non-NA values are present the result will be NA. | ||
""" | ||
if not skipna: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of the code here is not sum
-specific. is the idea to eventually make a template that gets re-used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's possible. I mainly want to implement first a single one, so we can review this / agree on the principles. Afterwards there can be more PRs implementing the other reductions, and then we should indeed see what's the best way to do this to avoid duplication.
Some of the code here looks similar to the min/max methods in DatetimeIndex/TimedeltaIndex (specifically the Indexes, not the EAs) since those use their cached masks. It'd be nice if some of that could be shared |
Do you mean this one? pandas/pandas/core/indexes/datetimelike.py Lines 239 to 271 in 37a7006
There are indeed some similar concepts, but I am not sure if it is that easy to share code (if you have specific ideas, feel free to share). Eg the above doesn't need to handle null_count, in my version I can hardcode pd.NA instead of needing to make this a variable. And it would only be the lines 262-269 in the above code that could be replaced with a function call. Maybe if we have a "templated" version (eg where you pass the numpy function to call as variable), as mentioned in one of the other inline comments, this would become easier. But I would personally only look into it when also implementing min/max in mask_ops (sum/prod are a bit special with the min_count). |
@jorisvandenbossche we need to share more code with the very well tested internal functions adding more and more exclusively for the extension types just adds more bugs i would refactor much sooner than later (now) rather than continuing to add things |
So I know this is adding code, but reasons to do this separately:
I could try to do a POC of combining so we have something to compare with. But even if possible with not too much hooks, for the above reasons I prefer to keep the "masked ops" separate. Personally, I would (short term) rather put effort in sharing the testing code instead of putting effort in sharing the actual implementation: improving testing by ensuring we re-use a lot of more of our extensive, existings tests for the numeric ops (so run those tests also for the nullable dtypes, instead of testing it separately as we do now). Another way of sharing code could also be the other way around: instead of trying to integrate this into nanops; we could also start looking into gradually using those "masked ops" more for some of the existing dtypes. For example, as @jbrockmendel mentioned above, the Datetime/TimedeltaIndex min/max functions could use a masked op version of min/max. |
The closest thing I had to a specific idea was "huh, this looks kinda similar to that other thing" |
OK, moved, so this is ready for another look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. small followon request; this needs a whatsnew note?
does this need asv's? |
Added the Int64/boolean dtype to the parametrization of some existing benchmarks (Any/All/NanOps and FrameOps). And added a short whatsnew notice. |
lgtm @jbrockmendel @TomAugspurger if any comments |
Don't think I'll have any today. Last I looked (a while ago) I was happy
with the overall strategy.
…On Thu, Mar 26, 2020 at 8:09 AM Jeff Reback ***@***.***> wrote:
lgtm @jbrockmendel <https://github.com/jbrockmendel> @TomAugspurger
<https://github.com/TomAugspurger> if any comments
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30982 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIUVSY2EJRRMQC6TORLRJNHYZANCNFSM4KGISNPA>
.
|
@@ -229,6 +229,8 @@ Performance improvements | |||
sparse values from ``scipy.sparse`` matrices using the | |||
:meth:`DataFrame.sparse.from_spmatrix` constructor (:issue:`32821`, | |||
:issue:`32825`, :issue:`32826`, :issue:`32856`, :issue:`32858`). | |||
- Performance improvement in :meth:`Series.sum` for nullable (integer and boolean) dtypes (:issue:`30982`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is DataFrame.sum
affected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, since we don't use the EA implementation for reductions in DataFrame reductions, that's my other PR
from pandas.core.nanops import _below_min_count | ||
|
||
|
||
def sum( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid overriding a builtin-in name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid overriding a builtin-in name?
Do we care? In practice this is only used as masked_reductions.sum(..)
, so there is no risk of confusing it (eg numpy also has a sum ..).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't have a problem with renaming it to eg masked_sum
, though (but all together it then becomes a bit long masked_reductions.masked_sum(..)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nansum
? Not a strong enough opinion to be a blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no nan's involved here ;), that's exactly the difference with nanops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha, sounds good
Boolean numpy array (True values indicate missing values). | ||
skipna : bool, default True | ||
Whether to skip NA. | ||
min_count : int, default 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required >=0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a count, so yes
pandas/core/nanops.py
Outdated
@@ -1238,7 +1238,7 @@ def _maybe_null_out( | |||
result: np.ndarray, | |||
axis: Optional[int], | |||
mask: Optional[np.ndarray], | |||
shape: Tuple, | |||
shape: Tuple[int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does Tuple[int]
describe a 1-tuple, or any-length-tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, apparently that needs to be Tuple[int, ...]
for a variable length tuple
@jbrockmendel merge if you are good with responses |
thanks @jorisvandenbossche |
And thanks for the reviews! |
if _np_version_under1p17: | ||
return np.sum(values[~mask]) | ||
else: | ||
return np.sum(values, where=~mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche This seems like a useful pattern for other reductions like min and max (maybe a special case is needed if we have a min_count
?), wondering if it might make sense to generalize this to those? Could be more performant than using nanops
, and could also be done in a way that's extensible to string dtype (where nanops
can't be used)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's indeed the goal of this PR (I just wanted to limit it to a single function for the initial PR). For min/max, I think it will be better to write a separate function in module since they don't use min_count
, but prod
should be able to reuse this.
I was planning to do follow-up PRs the and of this week, but feel free to pick it up!
The current
nanops
has quite some complexity that is not needed for the masked arrays. This is a small proof of concept to have separate implementations for our nullable masked arrays, taking the sum case (and still ignoring the additional kwargs).This is also quite a bit faster. On master:
the nullable Int64 basically does the same as the nanops implementation for float.
With this PR:
(when using bigger arrays, the speed-up becomes less big in relative factor. Here it's almost 4x with 10k elements, but with 1M it's aound 2x).
I think it would be interesting to gradually implement some of the ops specifically for the nullable masked arrays. Personally I think it will be easier to do this in new functions than trying to fit it into the existing nanops function.