Behavior of reductions on empty arrays should be specified #232

rgommers · 2021-07-21T19:09:20Z

Following up on pytorch/pytorch#61901 (comment). A design rule that NumPy, PyTorch and JAX seem to follow is:

Return identity if reduction has one, otherwise return nan for reductions that promote integers to floating point dtypes and raise an error for all other.

This is not specified anywhere, and I can't find a previous discussion on it. For element-wise functions there's a "Special Cases" section, but for reductions there isn't. @asmeurer does the test suite cover empty array input to functions at all?

seberg · 2021-07-21T19:52:06Z

Return identity if reduction has one, otherwise return nan for reductions that promote integers to floating point dtypes and raise an error for all other.

The note on integers is correct, but gets the logic backwards IMO: The reason for the float promotion is that this is a reduce-like operation (not a proper reduction) which includes a true-division in the definition. E.g. mean is:

sum(x) / len(x)  # or similar

Since sum([]) == 0, len([]) == 0, and 0 / 0 == NaN are all clearly well defined, the result is also well defined. Of course std with correction may divide e.g. by -1 instead of 0 if unchecked; but it is probably very straight forward to extrapolate.

One problem is that 0 / 0 == NaN works but integer division 0 //0 should probably just raise an error.

These reduce-likes cannot have an identity/initial value, only a default that usually follows cleanly from their definition. The behaviour in that case should clearly follow that of integer floor_divide (EDIT: Got mangled, the behaviour for dtype=integer).

I believe there was an argument/request for the NaN return in median, probably saying that median should match mean for ≤2 points (including 0). But, I do not remember and opinions might vary. Note that quantile does not have this behaviour.

asmeurer · 2021-07-21T20:09:53Z

Assuming all the reduction functions being referred to here are in the "statistical functions" section of the spec, I don't have any tests for them written yet in the test suite, beyond the basic signature tests that apply to every function. I agree that the reductions should either explicitly state what they do to empty arrays or else indicate that they are only specified for nonempty arrays. Right now, it is underspecified. I would also note that the statistical functions don't mention the allowed data types of their inputs (presumably they should all be either "numerical" or "floating-point").

Also I noticed that NumPy doesn't define min and max for empty arrays, but couldn't they return inf and -inf, respectively, if the input array has a floating-point dtype. I suppose they could return the max/min representable integer for integer dtypes as well. But perhaps this is a sign we should leave min and max on empty unspecified in the spec.

Functions like mean() that perform divisions require floating-point inputs. All other functions require numeric inputs. This was mentioned in data-apis#232 but does not fix the main question of that issue (the behavior of the statistical functions on empty arrays). I can implement that as well if we come to an agreement about what it should say.

asmeurer · 2021-07-21T20:19:16Z

I added input dtype restrictions at #234. Please see if they make sense. If we agree on behavior for empty inputs, I can add that to that PR as well.

lezcano · 2021-07-22T08:35:00Z

On identities and reductions:

Something that we came across in PyTorch when looking into these things is how annoying this is for min and max. For example, for max, the mathematical identity element is -inf. Now, this poses two problems:

It may be unexpected for the user, just filling up some array with infs
We have infs for floating point numbers, but not for integers. In this case, the identity would be the minimum (or maximum) representable number. This is even more counterintuitive.

After some discussion, we opted for not supporting reductions along zero dimensions for these operations. The whole discussion starts in this comment: pytorch/pytorch#52565 (comment)

rgommers · 2021-07-22T09:41:34Z

The note on integers is correct, but gets the logic backwards IMO: The reason for the float promotion is that this is a reduce-like operation (not a proper reduction) which includes a true-division in the definition.

I agree with your explanation for why there's the promote-to-float, but that doesn't mean it's backwards. Design should go in this order I think:

Is there an identity? If so, return that.
Does it return float output for integer dtype (yes for mean, because of the true division in its definition)
If it does return a floating point dtype, return nan (the only reasonable answer if there's no identity). If it returns an integer dtype, raise an exception.

I believe there was an argument/request for the NaN return in median, probably saying that median should match mean for ≤2 points (including 0). But, I do not remember and opinions might vary. Note that quantile does not have this behaviour.

That's a pretty ad-hoc reason. I'd rather have a more consistent set of rules. Part of the problem here seems to be with how median is defined (mathematically) with the midpoint rule:

if has_even_length(input):
   return (left_median + right_median) / 2  # float dtype
else:
   return median_val

So the output dtype depends on the shape of the input - that seems like a code smell. Although in NumPy the docs say:
A new array holding the result. If the input contains integers or floats smaller than float64, then the output data-type is np.float64.
So we thought of this in the past, and always return floating-point dtype:

>>> x = np.array([1, 2, 3], dtype=np.int64)
>>> np.median(x)
2.0

Which is nicer - but the price we pay for the midpoint rule is that the definition for odd length is unexpected - 2 is the median value, but we return np.float64(2). An alternative would have been to use // in the midpoint rule, to preserve dtype.

Either of those options seem okay, still consistent with the initial design rules I posted above. I guess that should be extended with a higher-level design rule that we've stated elsewhere: output dtype must be predictable from input dtypes.

asmeurer · 2021-07-22T23:09:40Z

Just to be clear, median is not currently part of the spec (see #10 (comment)).

asmeurer · 2021-07-26T19:45:09Z

Is it at least not controversial that empty sum and prod should give 0 and 1? Perhaps the remaining cases should be unspecified (library dependent).

kgryte · 2021-07-26T22:07:50Z

Agreed. sum should be 0 and prod should be 1 and the spec should state such.

asmeurer mentioned this issue Jul 21, 2021

Specify input dtypes for the statistical functions #234

Merged

kgryte mentioned this issue Sep 13, 2021

Specify behavior of statistical reductions over the empty set #262

Merged

rgommers closed this as completed in #262 Sep 15, 2021

data-apis deleted a comment Sep 16, 2021

seberg mentioned this issue Oct 29, 2022

ENH,DOC: np.max of a zero-sized array throws a ValueError numpy/numpy#22497

Open

mdhaber mentioned this issue Apr 21, 2024

RFC: add quantile? #795

Closed

jakevdp mentioned this issue Feb 18, 2025

Specification for empty matrix_norm and vector_norm? #901

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior of reductions on empty arrays should be specified #232

Behavior of reductions on empty arrays should be specified #232

rgommers commented Jul 21, 2021

seberg commented Jul 21, 2021 •

edited

Loading

asmeurer commented Jul 21, 2021

asmeurer commented Jul 21, 2021

lezcano commented Jul 22, 2021 •

edited

Loading

rgommers commented Jul 22, 2021

asmeurer commented Jul 22, 2021

asmeurer commented Jul 26, 2021

kgryte commented Jul 26, 2021

Behavior of reductions on empty arrays should be specified #232

Behavior of reductions on empty arrays should be specified #232

Comments

rgommers commented Jul 21, 2021

seberg commented Jul 21, 2021 • edited Loading

asmeurer commented Jul 21, 2021

asmeurer commented Jul 21, 2021

lezcano commented Jul 22, 2021 • edited Loading

rgommers commented Jul 22, 2021

asmeurer commented Jul 22, 2021

asmeurer commented Jul 26, 2021

kgryte commented Jul 26, 2021

seberg commented Jul 21, 2021 •

edited

Loading

lezcano commented Jul 22, 2021 •

edited

Loading