Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behavior of reductions on empty arrays should be specified #232

Closed
rgommers opened this issue Jul 21, 2021 · 8 comments · Fixed by #262
Closed

Behavior of reductions on empty arrays should be specified #232

rgommers opened this issue Jul 21, 2021 · 8 comments · Fixed by #262

Comments

@rgommers
Copy link
Member

Following up on pytorch/pytorch#61901 (comment). A design rule that NumPy, PyTorch and JAX seem to follow is:

Return identity if reduction has one, otherwise return nan for reductions that promote integers to floating point dtypes and raise an error for all other.

This is not specified anywhere, and I can't find a previous discussion on it. For element-wise functions there's a "Special Cases" section, but for reductions there isn't. @asmeurer does the test suite cover empty array input to functions at all?

@seberg
Copy link
Contributor

seberg commented Jul 21, 2021

Return identity if reduction has one, otherwise return nan for reductions that promote integers to floating point dtypes and raise an error for all other.

The note on integers is correct, but gets the logic backwards IMO: The reason for the float promotion is that this is a reduce-like operation (not a proper reduction) which includes a true-division in the definition. E.g. mean is:

sum(x) / len(x)  # or similar

Since sum([]) == 0, len([]) == 0, and 0 / 0 == NaN are all clearly well defined, the result is also well defined. Of course std with correction may divide e.g. by -1 instead of 0 if unchecked; but it is probably very straight forward to extrapolate.

One problem is that 0 / 0 == NaN works but integer division 0 //0 should probably just raise an error.

These reduce-likes cannot have an identity/initial value, only a default that usually follows cleanly from their definition. The behaviour in that case should clearly follow that of integer floor_divide (EDIT: Got mangled, the behaviour for dtype=integer).

I believe there was an argument/request for the NaN return in median, probably saying that median should match mean for ≤2 points (including 0). But, I do not remember and opinions might vary. Note that quantile does not have this behaviour.

@asmeurer
Copy link
Member

Assuming all the reduction functions being referred to here are in the "statistical functions" section of the spec, I don't have any tests for them written yet in the test suite, beyond the basic signature tests that apply to every function. I agree that the reductions should either explicitly state what they do to empty arrays or else indicate that they are only specified for nonempty arrays. Right now, it is underspecified. I would also note that the statistical functions don't mention the allowed data types of their inputs (presumably they should all be either "numerical" or "floating-point").

Also I noticed that NumPy doesn't define min and max for empty arrays, but couldn't they return inf and -inf, respectively, if the input array has a floating-point dtype. I suppose they could return the max/min representable integer for integer dtypes as well. But perhaps this is a sign we should leave min and max on empty unspecified in the spec.

asmeurer added a commit to asmeurer/array-api that referenced this issue Jul 21, 2021
Functions like mean() that perform divisions require floating-point inputs.
All other functions require numeric inputs.

This was mentioned in data-apis#232 but does not fix the main question of that issue
(the behavior of the statistical functions on empty arrays). I can implement
that as well if we come to an agreement about what it should say.
@asmeurer
Copy link
Member

I added input dtype restrictions at #234. Please see if they make sense. If we agree on behavior for empty inputs, I can add that to that PR as well.

@lezcano
Copy link
Contributor

lezcano commented Jul 22, 2021

On identities and reductions:

Something that we came across in PyTorch when looking into these things is how annoying this is for min and max. For example, for max, the mathematical identity element is -inf. Now, this poses two problems:

  • It may be unexpected for the user, just filling up some array with infs
  • We have infs for floating point numbers, but not for integers. In this case, the identity would be the minimum (or maximum) representable number. This is even more counterintuitive.

After some discussion, we opted for not supporting reductions along zero dimensions for these operations. The whole discussion starts in this comment: pytorch/pytorch#52565 (comment)

@rgommers
Copy link
Member Author

The note on integers is correct, but gets the logic backwards IMO: The reason for the float promotion is that this is a reduce-like operation (not a proper reduction) which includes a true-division in the definition.

I agree with your explanation for why there's the promote-to-float, but that doesn't mean it's backwards. Design should go in this order I think:

  1. Is there an identity? If so, return that.
  2. Does it return float output for integer dtype (yes for mean, because of the true division in its definition)
  3. If it does return a floating point dtype, return nan (the only reasonable answer if there's no identity). If it returns an integer dtype, raise an exception.

I believe there was an argument/request for the NaN return in median, probably saying that median should match mean for ≤2 points (including 0). But, I do not remember and opinions might vary. Note that quantile does not have this behaviour.

That's a pretty ad-hoc reason. I'd rather have a more consistent set of rules. Part of the problem here seems to be with how median is defined (mathematically) with the midpoint rule:

if has_even_length(input):
   return (left_median + right_median) / 2  # float dtype
else:
   return median_val 

So the output dtype depends on the shape of the input - that seems like a code smell. Although in NumPy the docs say:
A new array holding the result. If the input contains integers or floats smaller than float64, then the output data-type is np.float64.
So we thought of this in the past, and always return floating-point dtype:

>>> x = np.array([1, 2, 3], dtype=np.int64)
>>> np.median(x)
2.0

Which is nicer - but the price we pay for the midpoint rule is that the definition for odd length is unexpected - 2 is the median value, but we return np.float64(2). An alternative would have been to use // in the midpoint rule, to preserve dtype.

Either of those options seem okay, still consistent with the initial design rules I posted above. I guess that should be extended with a higher-level design rule that we've stated elsewhere: output dtype must be predictable from input dtypes.

@asmeurer
Copy link
Member

Just to be clear, median is not currently part of the spec (see #10 (comment)).

@asmeurer
Copy link
Member

Is it at least not controversial that empty sum and prod should give 0 and 1? Perhaps the remaining cases should be unspecified (library dependent).

@kgryte
Copy link
Contributor

kgryte commented Jul 26, 2021

Agreed. sum should be 0 and prod should be 1 and the spec should state such.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants