Skip to content

Commit b020891

Browse files
jrebackTomAugspurger
authored andcommitted
API: categorical grouping will no longer return the cartesian product (#20583)
* BUG: groupby with categorical and other columns closes #14942
1 parent 901fc64 commit b020891

File tree

15 files changed

+748
-419
lines changed

15 files changed

+748
-419
lines changed

doc/source/groupby.rst

+51-23
Original file line numberDiff line numberDiff line change
@@ -91,10 +91,10 @@ The mapping can be specified many different ways:
9191
- A Python function, to be called on each of the axis labels.
9292
- A list or NumPy array of the same length as the selected axis.
9393
- A dict or ``Series``, providing a ``label -> group name`` mapping.
94-
- For ``DataFrame`` objects, a string indicating a column to be used to group.
94+
- For ``DataFrame`` objects, a string indicating a column to be used to group.
9595
Of course ``df.groupby('A')`` is just syntactic sugar for
9696
``df.groupby(df['A'])``, but it makes life simpler.
97-
- For ``DataFrame`` objects, a string indicating an index level to be used to
97+
- For ``DataFrame`` objects, a string indicating an index level to be used to
9898
group.
9999
- A list of any of the above things.
100100

@@ -120,7 +120,7 @@ consider the following ``DataFrame``:
120120
'D' : np.random.randn(8)})
121121
df
122122
123-
On a DataFrame, we obtain a GroupBy object by calling :meth:`~DataFrame.groupby`.
123+
On a DataFrame, we obtain a GroupBy object by calling :meth:`~DataFrame.groupby`.
124124
We could naturally group by either the ``A`` or ``B`` columns, or both:
125125

126126
.. ipython:: python
@@ -360,8 +360,8 @@ Index level names may be specified as keys directly to ``groupby``.
360360
DataFrame column selection in GroupBy
361361
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
362362

363-
Once you have created the GroupBy object from a DataFrame, you might want to do
364-
something different for each of the columns. Thus, using ``[]`` similar to
363+
Once you have created the GroupBy object from a DataFrame, you might want to do
364+
something different for each of the columns. Thus, using ``[]`` similar to
365365
getting a column from a DataFrame, you can do:
366366

367367
.. ipython:: python
@@ -421,7 +421,7 @@ statement if you wish: ``for (k1, k2), group in grouped:``.
421421
Selecting a group
422422
-----------------
423423

424-
A single group can be selected using
424+
A single group can be selected using
425425
:meth:`~pandas.core.groupby.DataFrameGroupBy.get_group`:
426426

427427
.. ipython:: python
@@ -444,8 +444,8 @@ perform a computation on the grouped data. These operations are similar to the
444444
:ref:`aggregating API <basics.aggregate>`, :ref:`window functions API <stats.aggregate>`,
445445
and :ref:`resample API <timeseries.aggregate>`.
446446

447-
An obvious one is aggregation via the
448-
:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently
447+
An obvious one is aggregation via the
448+
:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently
449449
:meth:`~pandas.core.groupby.DataFrameGroupBy.agg` method:
450450

451451
.. ipython:: python
@@ -517,12 +517,12 @@ Some common aggregating functions are tabulated below:
517517
:meth:`~pd.core.groupby.DataFrameGroupBy.nth`;Take nth value, or a subset if n is a list
518518
:meth:`~pd.core.groupby.DataFrameGroupBy.min`;Compute min of group values
519519
:meth:`~pd.core.groupby.DataFrameGroupBy.max`;Compute max of group values
520-
521520

522-
The aggregating functions above will exclude NA values. Any function which
521+
522+
The aggregating functions above will exclude NA values. Any function which
523523
reduces a :class:`Series` to a scalar value is an aggregation function and will work,
524524
a trivial example is ``df.groupby('A').agg(lambda ser: 1)``. Note that
525-
:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a
525+
:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a
526526
filter, see :ref:`here <groupby.nth>`.
527527

528528
.. _groupby.aggregate.multifunc:
@@ -732,7 +732,7 @@ and that the transformed data contains no NAs.
732732
.. note::
733733

734734
Some functions will automatically transform the input when applied to a
735-
GroupBy object, but returning an object of the same shape as the original.
735+
GroupBy object, but returning an object of the same shape as the original.
736736
Passing ``as_index=False`` will not affect these transformation methods.
737737

738738
For example: ``fillna, ffill, bfill, shift.``.
@@ -926,7 +926,7 @@ The dimension of the returned result can also change:
926926

927927
In [11]: grouped.apply(f)
928928

929-
``apply`` on a Series can operate on a returned value from the applied function,
929+
``apply`` on a Series can operate on a returned value from the applied function,
930930
that is itself a series, and possibly upcast the result to a DataFrame:
931931

932932
.. ipython:: python
@@ -984,20 +984,48 @@ will be (silently) dropped. Thus, this does not pose any problems:
984984
985985
df.groupby('A').std()
986986
987-
Note that ``df.groupby('A').colname.std().`` is more efficient than
987+
Note that ``df.groupby('A').colname.std().`` is more efficient than
988988
``df.groupby('A').std().colname``, so if the result of an aggregation function
989-
is only interesting over one column (here ``colname``), it may be filtered
989+
is only interesting over one column (here ``colname``), it may be filtered
990990
*before* applying the aggregation function.
991991

992+
.. _groupby.observed:
993+
994+
Handling of (un)observed Categorical values
995+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
996+
997+
When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword
998+
controls whether to return a cartesian product of all possible groupers values (``observed=False``) or only those
999+
that are observed groupers (``observed=True``).
1000+
1001+
Show all values:
1002+
1003+
.. ipython:: python
1004+
1005+
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()
1006+
1007+
Show only the observed values:
1008+
1009+
.. ipython:: python
1010+
1011+
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).count()
1012+
1013+
The returned dtype of the grouped will *always* include *all* of the catergories that were grouped.
1014+
1015+
.. ipython:: python
1016+
1017+
s = pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()
1018+
s.index.dtype
1019+
9921020
.. _groupby.missing:
9931021

9941022
NA and NaT group handling
9951023
~~~~~~~~~~~~~~~~~~~~~~~~~
9961024

997-
If there are any NaN or NaT values in the grouping key, these will be
998-
automatically excluded. In other words, there will never be an "NA group" or
999-
"NaT group". This was not the case in older versions of pandas, but users were
1000-
generally discarding the NA group anyway (and supporting it was an
1025+
If there are any NaN or NaT values in the grouping key, these will be
1026+
automatically excluded. In other words, there will never be an "NA group" or
1027+
"NaT group". This was not the case in older versions of pandas, but users were
1028+
generally discarding the NA group anyway (and supporting it was an
10011029
implementation headache).
10021030

10031031
Grouping with ordered factors
@@ -1084,8 +1112,8 @@ This shows the first or last n rows from each group.
10841112
Taking the nth row of each group
10851113
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10861114

1087-
To select from a DataFrame or Series the nth item, use
1088-
:meth:`~pd.core.groupby.DataFrameGroupBy.nth`. This is a reduction method, and
1115+
To select from a DataFrame or Series the nth item, use
1116+
:meth:`~pd.core.groupby.DataFrameGroupBy.nth`. This is a reduction method, and
10891117
will return a single row (or no row) per group if you pass an int for n:
10901118

10911119
.. ipython:: python
@@ -1153,7 +1181,7 @@ Enumerate groups
11531181
.. versionadded:: 0.20.2
11541182

11551183
To see the ordering of the groups (as opposed to the order of rows
1156-
within a group given by ``cumcount``) you can use
1184+
within a group given by ``cumcount``) you can use
11571185
:meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`.
11581186

11591187

@@ -1273,7 +1301,7 @@ Regroup columns of a DataFrame according to their sum, and sum the aggregated on
12731301
Multi-column factorization
12741302
~~~~~~~~~~~~~~~~~~~~~~~~~~
12751303
1276-
By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract
1304+
By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract
12771305
information about the groups in a way similar to :func:`factorize` (as described
12781306
further in the :ref:`reshaping API <reshaping.factorize>`) but which applies
12791307
naturally to multiple columns of mixed type and different

doc/source/whatsnew/v0.23.0.txt

+52
Original file line numberDiff line numberDiff line change
@@ -396,6 +396,58 @@ documentation. If you build an extension array, publicize it on our
396396

397397
.. _cyberpandas: https://cyberpandas.readthedocs.io/en/latest/
398398

399+
.. _whatsnew_0230.enhancements.categorical_grouping:
400+
401+
Categorical Groupers has gained an observed keyword
402+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
403+
404+
In previous versions, grouping by 1 or more categorical columns would result in an index that was the cartesian product of all of the categories for
405+
each grouper, not just the observed values.``.groupby()`` has gained the ``observed`` keyword to toggle this behavior. The default remains backward
406+
compatible (generate a cartesian product). (:issue:`14942`, :issue:`8138`, :issue:`15217`, :issue:`17594`, :issue:`8669`, :issue:`20583`)
407+
408+
409+
.. ipython:: python
410+
411+
cat1 = pd.Categorical(["a", "a", "b", "b"],
412+
categories=["a", "b", "z"], ordered=True)
413+
cat2 = pd.Categorical(["c", "d", "c", "d"],
414+
categories=["c", "d", "y"], ordered=True)
415+
df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
416+
df['C'] = ['foo', 'bar'] * 2
417+
df
418+
419+
To show all values, the previous behavior:
420+
421+
.. ipython:: python
422+
423+
df.groupby(['A', 'B', 'C'], observed=False).count()
424+
425+
426+
To show only observed values:
427+
428+
.. ipython:: python
429+
430+
df.groupby(['A', 'B', 'C'], observed=True).count()
431+
432+
For pivotting operations, this behavior is *already* controlled by the ``dropna`` keyword:
433+
434+
.. ipython:: python
435+
436+
cat1 = pd.Categorical(["a", "a", "b", "b"],
437+
categories=["a", "b", "z"], ordered=True)
438+
cat2 = pd.Categorical(["c", "d", "c", "d"],
439+
categories=["c", "d", "y"], ordered=True)
440+
df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
441+
df
442+
443+
.. ipython:: python
444+
445+
pd.pivot_table(df, values='values', index=['A', 'B'],
446+
dropna=True)
447+
pd.pivot_table(df, values='values', index=['A', 'B'],
448+
dropna=False)
449+
450+
399451
.. _whatsnew_0230.enhancements.other:
400452

401453
Other Enhancements

pandas/conftest.py

+11
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,17 @@ def ip():
6666
return InteractiveShell()
6767

6868

69+
@pytest.fixture(params=[True, False, None])
70+
def observed(request):
71+
""" pass in the observed keyword to groupby for [True, False]
72+
This indicates whether categoricals should return values for
73+
values which are not in the grouper [False / None], or only values which
74+
appear in the grouper [True]. [None] is supported for future compatiblity
75+
if we decide to change the default (and would need to warn if this
76+
parameter is not passed)"""
77+
return request.param
78+
79+
6980
@pytest.fixture(params=[None, 'gzip', 'bz2', 'zip',
7081
pytest.param('xz', marks=td.skip_if_no_lzma)])
7182
def compression(request):

pandas/core/arrays/categorical.py

+29-2
Original file line numberDiff line numberDiff line change
@@ -647,8 +647,13 @@ def _set_categories(self, categories, fastpath=False):
647647

648648
self._dtype = new_dtype
649649

650-
def _codes_for_groupby(self, sort):
650+
def _codes_for_groupby(self, sort, observed):
651651
"""
652+
Code the categories to ensure we can groupby for categoricals.
653+
654+
If observed=True, we return a new Categorical with the observed
655+
categories only.
656+
652657
If sort=False, return a copy of self, coded with categories as
653658
returned by .unique(), followed by any categories not appearing in
654659
the data. If sort=True, return self.
@@ -661,6 +666,8 @@ def _codes_for_groupby(self, sort):
661666
----------
662667
sort : boolean
663668
The value of the sort parameter groupby was called with.
669+
observed : boolean
670+
Account only for the observed values
664671
665672
Returns
666673
-------
@@ -671,6 +678,26 @@ def _codes_for_groupby(self, sort):
671678
categories in the original order.
672679
"""
673680

681+
# we only care about observed values
682+
if observed:
683+
unique_codes = unique1d(self.codes)
684+
cat = self.copy()
685+
686+
take_codes = unique_codes[unique_codes != -1]
687+
if self.ordered:
688+
take_codes = np.sort(take_codes)
689+
690+
# we recode according to the uniques
691+
categories = self.categories.take(take_codes)
692+
codes = _recode_for_categories(self.codes,
693+
self.categories,
694+
categories)
695+
696+
# return a new categorical that maps our new codes
697+
# and categories
698+
dtype = CategoricalDtype(categories, ordered=self.ordered)
699+
return type(self)(codes, dtype=dtype, fastpath=True)
700+
674701
# Already sorted according to self.categories; all is fine
675702
if sort:
676703
return self
@@ -2161,7 +2188,7 @@ def unique(self):
21612188
# exclude nan from indexer for categories
21622189
take_codes = unique_codes[unique_codes != -1]
21632190
if self.ordered:
2164-
take_codes = sorted(take_codes)
2191+
take_codes = np.sort(take_codes)
21652192
return cat.set_categories(cat.categories.take(take_codes))
21662193

21672194
def _values_for_factorize(self):

pandas/core/generic.py

+9-2
Original file line numberDiff line numberDiff line change
@@ -6599,7 +6599,7 @@ def clip_lower(self, threshold, axis=None, inplace=False):
65996599
axis=axis, inplace=inplace)
66006600

66016601
def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
6602-
group_keys=True, squeeze=False, **kwargs):
6602+
group_keys=True, squeeze=False, observed=None, **kwargs):
66036603
"""
66046604
Group series using mapper (dict or key function, apply given function
66056605
to group, return result as series) or by a series of columns.
@@ -6632,6 +6632,13 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
66326632
squeeze : boolean, default False
66336633
reduce the dimensionality of the return type if possible,
66346634
otherwise return a consistent type
6635+
observed : boolean, default None
6636+
if True: only show observed values for categorical groupers.
6637+
if False: show all values for categorical groupers.
6638+
if None: if any categorical groupers, show a FutureWarning,
6639+
default to False.
6640+
6641+
.. versionadded:: 0.23.0
66356642
66366643
Returns
66376644
-------
@@ -6665,7 +6672,7 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
66656672
axis = self._get_axis_number(axis)
66666673
return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
66676674
sort=sort, group_keys=group_keys, squeeze=squeeze,
6668-
**kwargs)
6675+
observed=observed, **kwargs)
66696676

66706677
def asfreq(self, freq, method=None, how=None, normalize=False,
66716678
fill_value=None):

0 commit comments

Comments
 (0)