-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support mask in groupby sum #48018
Conversation
pandas/_libs/groupby.pyx
Outdated
and is_datetimelike | ||
and val == <float64_t>NPY_NAT | ||
and val == NPY_NAT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this should be part of isna_entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, makes way more sense. Thanks
Are you ok with merging this? Joris is linked for notification purposes |
Sure, we can have a followup PR if any are needed. |
Not strictly specific to this PR (it's already existing behaviour), but noticed from looking at the changes here: the resulting (I suppose right now (before this PR), after going through float in the group_sum algo, we cast back to the original dtype, so preserving the input dtype) |
(the same also applies to |
This is indeed a change in behavior, currently we are casting int8 to float, if they are to large. This pr causes a overflow. Will fix this in a follow up. We can simply use int64 or uint64 as out dtype and try to cast back if possible afterwards |
Ah, yes, it's indeed a behaviour change that it can now overflow (because of not casting to float before the algo). We currently try to cast back, and that can still give an overflow.
so in addition to using (u)int64 as |
Just that I understand you correctly, you don’t want to cast back at all , independently of overflow issues? don’t have a strong opinion. Just something to consider: the output of a sum might be considerably smaller than from a groupby operation, this might be important when considering the memory footprint. But I would be open to not casting back in general, just wanted to mention this |
Independently of this, this is also an issue for cumsum |
Yes, it is certainly true that because of the grouping, you might less easily run into overflow. Although with sufficiently large data / few large groups, I think in practice people can also easily get that. Now, this is an existing issue. Before this PR (+ #48059), we tried to cast back the float values to original dtype (and kept float if not possible), and not we cast back the (u)int64 values to the original dtype (and keep the (u)int64 if not possible). So at least that is already an improvement, and I can still open an issue about the "try to cast back" in general. |
Yes agreed. We are better off now than before. As mentioned above I am not opposed to avoid casting back. But this is a bit out of scope here, so I think a new issue is a good idea. |
* ENH: Support mask in groupby sum * ENH: Support mask in groupby sum * Fix mypy * Refactor if condition
* ENH: Support mask in groupby sum * ENH: Support mask in groupby sum * Fix mypy * Refactor if condition
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.cc @jorisvandenbossche