Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address polymorphic return value for unique? #212

Closed
rgommers opened this issue Jun 30, 2021 · 0 comments · Fixed by #275
Closed

Address polymorphic return value for unique? #212

rgommers opened this issue Jun 30, 2021 · 0 comments · Fixed by #275
Labels
API change Changes to existing functions or objects in the API.
Milestone

Comments

@rgommers
Copy link
Member

unique was added very early on in gh-25. The review discussions have focused on

There is another issue that was not discussed: unique is the only function whose return type is polymorphic. It can return an array, or a length-2, 3 or 4 tuple of arrays:

unique(x, /, *, return_counts=False, return_index=False, return_inverse=False)

API doc: https://data-apis.org/array-api/latest/API_specification/set_functions.html#unique-x-return-counts-false-return-index-false-return-inverse-false

This issue is painful for, e.g., libraries that have to support it in their JIT compiler. For svd, which previously was polymorphic too, we decided that it was better to introduce svdvals to get rid of the polymorphism. There's even a design principle that says as much in https://data-apis.org/array-api/latest/extensions/linear_algebra_functions.html: "In general, interfaces should avoid polymorphic return values (e.g., returning an array or a namedtuple, dependent on, e.g., an optional keyword argument)."

Given that there are 3 boolean keywords, it's not feasible to split unique into separate functions (there'd be 8 of them). An alternative may be:

  • return a custom object with four named fields: (unique, indices, inverse, counts).
  • no tuple unpacking
  • if data is returned, its type is array
  • for fields where the boolean keyword was False, accessing the field is undefined behavior

Another question that may be relevant is: how often are the boolean keywords each used? The data at https://raw.githubusercontent.com/data-apis/python-record-api/master/data/typing/numpy.py may help there:

@overload
def unique(ar: numpy.ndarray, return_inverse: bool):
    """
    usage.scipy: 14
    usage.skimage: 5
    usage.sklearn: 125
    usage.statsmodels: 18
    usage.xarray: 3
    """
    ...


@overload
def unique(ar: numpy.ndarray):
    """
    usage.dask: 7
    usage.matplotlib: 6
    usage.orange3: 16
    usage.scipy: 39
    usage.seaborn: 5
    usage.skimage: 48
    usage.sklearn: 385
    usage.statsmodels: 45
    usage.xarray: 10
    """
    ...


@overload
def unique(ar: numpy.ndarray, return_inverse: bool, return_counts: bool):
    """
    usage.skimage: 3
    usage.sklearn: 2
    usage.statsmodels: 1
    """
    ...


@overload
def unique(ar: numpy.ndarray, return_counts: bool):
    """
    usage.orange3: 3
    usage.scipy: 5
    usage.skimage: 8
    usage.sklearn: 1
    usage.statsmodels: 1
    """
    ...


@overload
def unique(ar: numpy.ndarray, return_index: bool):
    """
    usage.scipy: 5
    usage.skimage: 4
    usage.sklearn: 1
    usage.statsmodels: 1
    """
    ...


@overload
def unique(ar: List[int], return_counts: bool):
    """
    usage.orange3: 1
    """
    ...


@overload
def unique(ar: xarray.core.dataarray.DataArray):
    """
    usage.xarray: 2
    """
    ...


@overload
def unique(ar: List[numpy.int32]):
    """
    usage.statsmodels: 1
    """
    ...


@overload
def unique(ar: List[Literal["b", "a"]]):
    """
    usage.sklearn: 2
    usage.statsmodels: 1
    """
    ...


@overload
def unique(ar: numpy.ndarray, return_index: bool, return_inverse: bool):
    """
    usage.matplotlib: 1
    usage.sklearn: 6
    usage.statsmodels: 3
    """
    ...


@overload
def unique(ar: List[Literal["4"]], return_inverse: bool):
    """
    usage.statsmodels: 1
    """
    ...


@overload
def unique(
    ar: Union[
        pandas.core.series.Series,
        numpy.ndarray,
        pandas.core.arrays.categorical.Categorical,
        List[str],
    ],
    return_inverse: bool = ...,
    return_index: bool = ...,
):
    """
    usage.pandas: 22
    """
    ...


@overload
def unique(ar: List[float]):
    """
    usage.scipy: 2
    usage.sklearn: 3
    """
    ...


@overload
def unique(ar: numpy.ndarray, axis: int):
    """
    usage.scipy: 1
    """
    ...


@overload
def unique(ar: pandas.core.series.Series):
    """
    usage.dask: 3
    usage.geopandas: 1
    usage.prophet: 4
    usage.seaborn: 4
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: numpy.ma.core.MaskedArray):
    """
    usage.matplotlib: 1
    """
    ...


@overload
def unique(ar: List[int]):
    """
    usage.seaborn: 2
    usage.sklearn: 26
    """
    ...


@overload
def unique(ar: List[numpy.int64]):
    """
    usage.dask: 4
    """
    ...


@overload
def unique(ar: List[numpy.float64]):
    """
    usage.dask: 5
    usage.sklearn: 2
    """
    ...


@overload
def unique(ar: List[numpy.complex128]):
    """
    usage.dask: 3
    """
    ...


@overload
def unique(ar: List[numpy.bool_]):
    """
    usage.dask: 3
    """
    ...


@overload
def unique(ar: List[numpy.float32]):
    """
    usage.dask: 2
    """
    ...


@overload
def unique(
    ar: numpy.ndarray, return_index: bool, return_inverse: bool, return_counts: bool
):
    """
    usage.dask: 2
    """
    ...


@overload
def unique(ar: numpy.memmap):
    """
    usage.sklearn: 2
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "beer", "pizza", "the"]]):
    """
    usage.sklearn: 2
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "beer", "burger", "pizza", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "beer", "burger", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "coke", "burger", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["burger", "coke", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "celeri", "salad", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "water", "sparkling", "salad", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "celeri", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["water", "salad", "tomato", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["copyright", "water", "salad", "tomato", "the"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["three", "two", "one"]]):
    """
    usage.sklearn: 3
    """
    ...


@overload
def unique(ar: List[Union[float, int]], return_inverse: bool):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["b", "c", "a"]]):
    """
    usage.sklearn: 1
    """
    ...


@overload
def unique(ar: List[Literal["3", "2", "1"]]):
    """
    usage.sklearn: 2
    """
    ...


def unique(
    ar: object,
    return_index: bool = ...,
    return_inverse: bool = ...,
    return_counts: bool = ...,
):
    """
    usage.dask: 29
    usage.geopandas: 1
    usage.matplotlib: 8
    usage.orange3: 20
    usage.pandas: 22
    usage.prophet: 4
    usage.scipy: 66
    usage.seaborn: 11
    usage.skimage: 68
    usage.sklearn: 574
    usage.statsmodels: 72
    usage.xarray: 15

Conclusion: a large majority of usage is unique(x), without any keywords. return_inverse is fairly often used, return_index almost never (checked by searching Pandas, sklearn et al.).

So another alternative count be to include two separate functions:

  1. unique(x, /) - returns array of unique values
  2. unique_all(x, /) - returns tuple of 4 arrays (may need a better name)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API change Changes to existing functions or objects in the API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants