ENH: Support mask in groupby sum #48018

phofl · 2022-08-09T21:12:19Z

xref ENH: support masked arrays in groupby cython algos #37493 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jbrockmendel · 2022-08-09T22:35:08Z

pandas/_libs/groupby.pyx

                        and is_datetimelike
-                        and val == <float64_t>NPY_NAT
+                        and val == NPY_NAT


i think this should be part of isna_entry

good point, makes way more sense. Thanks

pandas/_libs/groupby.pyx

phofl · 2022-08-11T17:23:02Z

Are you ok with merging this? Joris is linked for notification purposes

mroeschke · 2022-08-11T17:24:13Z

Are you ok with merging this? Joris is linked for notification purposes

Sure, we can have a followup PR if any are needed.

jorisvandenbossche · 2022-08-11T18:24:18Z

Not strictly specific to this PR (it's already existing behaviour), but noticed from looking at the changes here: the resulting out has the same type as the input values (both typed as the fused sum_t).
However, for a plain sum (not grouped sum), we (or numpy) always output (u)int64 for any integer input dtype, in contrast to float32 and float64 which keep there precision in sum. So now that we introduce integer support for the grouped sum (in addition to float), we should maybe also consider making this consistent with Series.sum behaviour?

(I suppose right now (before this PR), after going through float in the group_sum algo, we cast back to the original dtype, so preserving the input dtype)

jorisvandenbossche · 2022-08-11T18:25:21Z

(the same also applies to prod #48027)

phofl · 2022-08-11T18:49:39Z

This is indeed a change in behavior, currently we are casting int8 to float, if they are to large. This pr causes a overflow. Will fix this in a follow up. We can simply use int64 or uint64 as out dtype and try to cast back if possible afterwards

jorisvandenbossche · 2022-08-11T18:53:56Z

Ah, yes, it's indeed a behaviour change that it can now overflow (because of not casting to float before the algo). We currently try to cast back, and that can still give an overflow. ~~We currently warn about that~~ (edit the example below is actually warning in the Series constructor because I used too large values ...):

In [8]: pd.Series([150, 150, 3, 100], dtype="int8").groupby([0, 0, 1, 1]).sum()
<ipython-input-8-03db26e15165>:1: FutureWarning: Values are too large to be losslessly cast to int8. In a future version this will raise OverflowError. To retain the old behavior, use pd.Series(values).astype(int8)
  pd.Series([150, 150, 3, 100], dtype="int8").groupby([0, 0, 1, 1]).sum()
Out[8]: 
0   -212.0
1    103.0
dtype: float64

so in addition to using (u)int64 as out dtype inside the group_sum algo, I think we should also consider to not cast back (we also don't do that for Series.sum)

phofl · 2022-08-11T19:13:26Z

Just that I understand you correctly, you don’t want to cast back at all , independently of overflow issues?

don’t have a strong opinion. Just something to consider: the output of a sum might be considerably smaller than from a groupby operation, this might be important when considering the memory footprint. But I would be open to not casting back in general, just wanted to mention this

phofl · 2022-08-11T19:28:22Z

Independently of this, this is also an issue for cumsum

jorisvandenbossche · 2022-08-12T18:25:03Z

Yes, it is certainly true that because of the grouping, you might less easily run into overflow. Although with sufficiently large data / few large groups, I think in practice people can also easily get that.
And it's also true that with a plain sum, you get a scalar result (for which memory typically won't matter that much), while for groupby you have a full column, for which the data type might be more important.

Now, this is an existing issue. Before this PR (+ #48059), we tried to cast back the float values to original dtype (and kept float if not possible), and not we cast back the (u)int64 values to the original dtype (and keep the (u)int64 if not possible). So at least that is already an improvement, and I can still open an issue about the "try to cast back" in general.

phofl · 2022-08-12T18:29:09Z

Yes agreed. We are better off now than before. As mentioned above I am not opposed to avoid casting back. But this is a bit out of scope here, so I think a new issue is a good idea.

* ENH: Support mask in groupby sum * ENH: Support mask in groupby sum * Fix mypy * Refactor if condition

phofl added 2 commits August 9, 2022 21:32

ENH: Support mask in groupby sum

968a959

ENH: Support mask in groupby sum

9184025

phofl added Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Aug 9, 2022

Fix mypy

5df22bc

jbrockmendel reviewed Aug 9, 2022

View reviewed changes

Refactor if condition

95fde33

phofl mentioned this pull request Aug 10, 2022

ENH: Support masks in groupby prod #48027

Merged

5 tasks

mroeschke added this to the 1.5 milestone Aug 11, 2022

mroeschke reviewed Aug 11, 2022

View reviewed changes

pandas/_libs/groupby.pyx Show resolved Hide resolved

mroeschke approved these changes Aug 11, 2022

View reviewed changes

mroeschke merged commit 726994e into pandas-dev:main Aug 11, 2022

phofl deleted the groupby_sum_mask branch August 11, 2022 19:26

phofl mentioned this pull request Aug 11, 2022

REGR: groupby sum causing overflow for int8 #48044

Closed

YYYasin19 pushed a commit to YYYasin19/pandas that referenced this pull request Aug 23, 2022

ENH: Support mask in groupby sum (pandas-dev#48018)

c1d8978

* ENH: Support mask in groupby sum * ENH: Support mask in groupby sum * Fix mypy * Refactor if condition

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022

ENH: Support mask in groupby sum (pandas-dev#48018)

5b39d00

* ENH: Support mask in groupby sum * ENH: Support mask in groupby sum * Fix mypy * Refactor if condition

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Support mask in groupby sum #48018

ENH: Support mask in groupby sum #48018

phofl commented Aug 9, 2022 •

edited

Loading

jbrockmendel Aug 9, 2022

phofl Aug 10, 2022

phofl commented Aug 11, 2022

mroeschke commented Aug 11, 2022

jorisvandenbossche commented Aug 11, 2022

jorisvandenbossche commented Aug 11, 2022

phofl commented Aug 11, 2022

jorisvandenbossche commented Aug 11, 2022 •

edited

Loading

phofl commented Aug 11, 2022

phofl commented Aug 11, 2022

jorisvandenbossche commented Aug 12, 2022

phofl commented Aug 12, 2022

ENH: Support mask in groupby sum #48018

ENH: Support mask in groupby sum #48018

Conversation

phofl commented Aug 9, 2022 • edited Loading

jbrockmendel Aug 9, 2022

Choose a reason for hiding this comment

phofl Aug 10, 2022

Choose a reason for hiding this comment

phofl commented Aug 11, 2022

mroeschke commented Aug 11, 2022

jorisvandenbossche commented Aug 11, 2022

jorisvandenbossche commented Aug 11, 2022

phofl commented Aug 11, 2022

jorisvandenbossche commented Aug 11, 2022 • edited Loading

phofl commented Aug 11, 2022

phofl commented Aug 11, 2022

jorisvandenbossche commented Aug 12, 2022

phofl commented Aug 12, 2022

phofl commented Aug 9, 2022 •

edited

Loading

jorisvandenbossche commented Aug 11, 2022 •

edited

Loading