Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: making value_counts stable/keeping original ordering #39009

Merged
merged 14 commits into from
Jan 22, 2021

Conversation

realead
Copy link
Contributor

@realead realead commented Jan 6, 2021

closes #12679
closes #11227

  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

The order of the returned keys for value_counts aren't arbitrary (i.e. depending on the used hash function) but are original ordering (when sorted this applies for the keys with the same number of values).

@realead
Copy link
Contributor Author

realead commented Jan 6, 2021

There are still some issues that needs to be addressed/discussed:

  1. mode needs probably to reuse the same "stable" functionality. Right now there are differences between value_count and mode (see BUG: Inconsistent results of value_counts and mode for different nan-values #39007). If these differences are not deliberate and just a bug/oversight - that would make the life a little bit easier.
  2. value_count has now a special postprocessing step for nans, ensuring that e.g. pd.NA and np.nan are handled as the same object (
    mask = isna(values)
    if not dropna and mask.any():
    if not isna(keys).any():
    keys = np.insert(keys, 0, np.NaN)
    counts = np.insert(counts, 0, mask.sum())
    ) The downside of this approach is, that nans aren't in the "original order". Probably a better approach would be something like this:
cpdef stable_value_count_object(ndarray[object] values, bint dropna, na_representant=np.nan):
    ...
    for i in range(n):
        val = values[i]
        if checknull(val):
            val = na_representant  # uses one representant for all null-objects
        if not checknull(val) or not dropna:
            k = kh_get_{{ttype}}(table, <PyObject*>val)
            ...

this would be slightly faster than the current approach, would work also for mode out of the box if reused without additional postprocessing, and nans would keep their "original" place.

  1. non-object branch is no longer nogil (71edcca#diff-f4f1cb812ad02cac95da4070745323e8cf31dfbcc65da52744bb6721439210bbR123) Not sure it is a big issue. However, to improve it, probably Int64Vector & Co:
    cdef class {{name}}Vector:
    {{if dtype != 'int64'}}
    cdef:
    bint external_view_exists
    {{name}}VectorData *data
    ndarray ao
    {{endif}}
    should be improved. Right now there are some issues (e.g. the whole external_view_exists-business is very annoying. Better would be no longer hold the reference to the from to_array returned object, as it cannot be safely used anyway (it could be resized outside making data a dangling pointer)) which I would like not to address in this PR.

@realead
Copy link
Contributor Author

realead commented Jan 6, 2021

surprising asv_bench showed no changes when ran with:

asv continuous -f 1.05 upstream/master HEAD  -b series_methods -b period -b categoricals -b frame_methods

@jreback jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jan 7, 2021
@realead realead force-pushed the stable_count_values branch 2 times, most recently from 6aa9652 to 8718976 Compare January 8, 2021 22:35
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good a few comments. pls add a whatsnew note in bug fixes reshaping section. also if this is closing multiple issues (iir also the 32-bit ordering one?) then indicate them there as well.

{{else}}
build_count_table_{{dtype}}(values, table, dropna)
result_keys = {{'U'+dtype[1::].title()}}Vector()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was just too clever, changed to {{name} here: 9a17c13

@realead realead force-pushed the stable_count_values branch from 8718976 to 0966c2b Compare January 14, 2021 21:43
@realead realead changed the title WIP: ENH: making value_counts stable/keeping original ordering ENH: making value_counts stable/keeping original ordering Jan 14, 2021
@realead realead force-pushed the stable_count_values branch from 0966c2b to e64b121 Compare January 14, 2021 21:53
@realead
Copy link
Contributor Author

realead commented Jan 14, 2021

I went on with 2. from comment #39009 (comment)

I think 1. and 3. from this comment are better as separate PRs.

@realead realead force-pushed the stable_count_values branch from e64b121 to e532854 Compare January 15, 2021 20:31
@jreback
Copy link
Contributor

jreback commented Jan 21, 2021

can you merge master.

@jreback
Copy link
Contributor

jreback commented Jan 21, 2021

@realead
Copy link
Contributor Author

realead commented Jan 22, 2021

@jreback Thanks, fixed now

@jreback jreback added this to the 1.3 milestone Jan 22, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm minor comments. ping on green

@@ -356,6 +356,7 @@ Reshaping
- Bug in :func:`join` over :class:`MultiIndex` returned wrong result, when one of both indexes had only one level (:issue:`36909`)
- :meth:`merge_asof` raises ``ValueError`` instead of cryptic ``TypeError`` in case of non-numerical merge columns (:issue:`29130`)
- Bug in :meth:`DataFrame.join` not assigning values correctly when having :class:`MultiIndex` where at least one dimension is from dtype ``Categorical`` with non-alphabetically sorted categories (:issue:`38502`)
- ``value_counts`` returns keys in original order (:issue:`12679`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:meth:`Series.value_counts`, can you add the other issue that is being closed here as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done d85b1aa

is_null = checknull(val)
if is_null:
val = navalue
if not is_null or not dropna:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be elif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how it could work with elif (but I'm always struggeling with logic). I have rearanged the if and hope it is clearer now: 230215f

@realead
Copy link
Contributor Author

realead commented Jan 22, 2021

@jreback green

@jreback jreback merged commit 7e531e3 into pandas-dev:master Jan 22, 2021
@jreback
Copy link
Contributor

jreback commented Jan 22, 2021

thanks @realead

@realead realead deleted the stable_count_values branch January 27, 2021 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff
Projects
None yet
2 participants