Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support mask in groupby sum #48018

Merged
merged 4 commits into from
Aug 11, 2022
Merged

Conversation

phofl
Copy link
Member

@phofl phofl commented Aug 9, 2022

cc @jorisvandenbossche

@phofl phofl added Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Aug 9, 2022
and is_datetimelike
and val == <float64_t>NPY_NAT
and val == NPY_NAT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should be part of isna_entry

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, makes way more sense. Thanks

@phofl phofl mentioned this pull request Aug 10, 2022
5 tasks
@mroeschke mroeschke added this to the 1.5 milestone Aug 11, 2022
@phofl
Copy link
Member Author

phofl commented Aug 11, 2022

Are you ok with merging this? Joris is linked for notification purposes

@mroeschke mroeschke merged commit 726994e into pandas-dev:main Aug 11, 2022
@mroeschke
Copy link
Member

Are you ok with merging this? Joris is linked for notification purposes

Sure, we can have a followup PR if any are needed.

@jorisvandenbossche
Copy link
Member

Not strictly specific to this PR (it's already existing behaviour), but noticed from looking at the changes here: the resulting out has the same type as the input values (both typed as the fused sum_t).
However, for a plain sum (not grouped sum), we (or numpy) always output (u)int64 for any integer input dtype, in contrast to float32 and float64 which keep there precision in sum. So now that we introduce integer support for the grouped sum (in addition to float), we should maybe also consider making this consistent with Series.sum behaviour?

(I suppose right now (before this PR), after going through float in the group_sum algo, we cast back to the original dtype, so preserving the input dtype)

@jorisvandenbossche
Copy link
Member

(the same also applies to prod #48027)

@phofl
Copy link
Member Author

phofl commented Aug 11, 2022

This is indeed a change in behavior, currently we are casting int8 to float, if they are to large. This pr causes a overflow. Will fix this in a follow up. We can simply use int64 or uint64 as out dtype and try to cast back if possible afterwards

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 11, 2022

Ah, yes, it's indeed a behaviour change that it can now overflow (because of not casting to float before the algo). We currently try to cast back, and that can still give an overflow. We currently warn about that (edit the example below is actually warning in the Series constructor because I used too large values ...):

In [8]: pd.Series([150, 150, 3, 100], dtype="int8").groupby([0, 0, 1, 1]).sum()
<ipython-input-8-03db26e15165>:1: FutureWarning: Values are too large to be losslessly cast to int8. In a future version this will raise OverflowError. To retain the old behavior, use pd.Series(values).astype(int8)
  pd.Series([150, 150, 3, 100], dtype="int8").groupby([0, 0, 1, 1]).sum()
Out[8]: 
0   -212.0
1    103.0
dtype: float64

so in addition to using (u)int64 as out dtype inside the group_sum algo, I think we should also consider to not cast back (we also don't do that for Series.sum)

@phofl
Copy link
Member Author

phofl commented Aug 11, 2022

Just that I understand you correctly, you don’t want to cast back at all , independently of overflow issues?

don’t have a strong opinion. Just something to consider: the output of a sum might be considerably smaller than from a groupby operation, this might be important when considering the memory footprint. But I would be open to not casting back in general, just wanted to mention this

@phofl phofl deleted the groupby_sum_mask branch August 11, 2022 19:26
@phofl
Copy link
Member Author

phofl commented Aug 11, 2022

Independently of this, this is also an issue for cumsum

@jorisvandenbossche
Copy link
Member

Yes, it is certainly true that because of the grouping, you might less easily run into overflow. Although with sufficiently large data / few large groups, I think in practice people can also easily get that.
And it's also true that with a plain sum, you get a scalar result (for which memory typically won't matter that much), while for groupby you have a full column, for which the data type might be more important.

Now, this is an existing issue. Before this PR (+ #48059), we tried to cast back the float values to original dtype (and kept float if not possible), and not we cast back the (u)int64 values to the original dtype (and keep the (u)int64 if not possible). So at least that is already an improvement, and I can still open an issue about the "try to cast back" in general.

@phofl
Copy link
Member Author

phofl commented Aug 12, 2022

Yes agreed. We are better off now than before. As mentioned above I am not opposed to avoid casting back. But this is a bit out of scope here, so I think a new issue is a good idea.

YYYasin19 pushed a commit to YYYasin19/pandas that referenced this pull request Aug 23, 2022
* ENH: Support mask in groupby sum

* ENH: Support mask in groupby sum

* Fix mypy

* Refactor if condition
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* ENH: Support mask in groupby sum

* ENH: Support mask in groupby sum

* Fix mypy

* Refactor if condition
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants