-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add Series & DataFrame .agg/.aggregate #14668
Conversation
@jreback Cool! Some quick feedback based on your examples above:
|
Codecov Report
@@ Coverage Diff @@
## master #14668 +/- ##
==========================================
- Coverage 91.11% 91.02% -0.09%
==========================================
Files 145 145
Lines 50332 50391 +59
==========================================
+ Hits 45858 45869 +11
- Misses 4474 4522 +48
Continue to review full report at Codecov.
|
so typically in groupby/rolling/resample we don't do transform operations, unless we explicity want them (e.g. by using With a series/dataframe OTOH, I think this is fairly common. I may actually have to do the computations before can determine this though (and raise an appropriate error). |
update the top section a bit. Not sure what I should do in cases like this. could skip, or maybe raise better message.
|
we could add a |
pandas/core/generic.py
Outdated
.. versionadded:: 0.19.2 | ||
|
||
Parameters | ||
---------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
havent' really edited these yet
6a59922
to
92eb36c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bunch of comments
doc/source/basics.rst
Outdated
|
||
.. versionadded:: 0.20.0 | ||
|
||
The aggregation APi allows one to express possibly multiple aggregation operations in a single concise way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
APi -> API
|
||
tsdf.agg(np.sum) | ||
|
||
tsdf.agg('sum') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe show here that this is the same as tsdf.sum()
?
doc/source/basics.rst
Outdated
|
||
.. ipython:: python | ||
|
||
tsdf.A.agg({'foo' : ['sum', 'mean']}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit strange. IMO the more logical example would be
In [11]: tsdf.A.agg({'foo' : 'sum', 'bar':'mean'})
Out[11]:
foo -2.019230
bar -0.336538
Name: A, dtype: float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I added that too (conceptually both are useful)
doc/source/basics.rst
Outdated
|
||
.. ipython:: python | ||
|
||
tsdf.A.agg({'foo' : ['sum', 'mean'], 'bar': ['min', 'max', lambda x: x.sum()+1]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is that this would give a MultiIndexed series instead of a DataFrame:
So now with PR:
In [12]: tsdf.A.agg({'foo' : ['sum', 'mean'], 'bar': ['min', 'max', lambda x: x.sum()+1]})
Out[12]:
foo bar
<lambda> NaN -1.019230
max NaN 1.118963
mean -0.336538 NaN
min NaN -1.476450
sum -2.019230 NaN
But could also give something like:
In [18]: tsdf.A.agg({'foo' : ['sum', 'mean'], 'bar': ['min', 'max', lambda x: x.sum()+1]}).stack().swaplevel(0,1).sort_index()
Out[18]:
foo mean -0.336538
sum -2.019230
bar <lambda> -1.019230
max 1.118963
min -1.476450
dtype: float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I was just trying to produce frames, but maybe this is more useful. let me see what I can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chaning this fails a couple of the groupby tests, so going to leave it for now. I think the logic is simpler anyhow.
doc/source/basics.rst
Outdated
|
||
.. versionadded:: 0.20.0 | ||
|
||
The ``transform`` method returns an object that is indexed the same (same size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transform -> :method:`~DataFrame.transform`
pandas/core/generic.py
Outdated
- function | ||
- list of functions | ||
- dict of columns -> functions | ||
- nested dict of names -> dicts of functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to remove, i copied from somewhere :>
result = self.agg(func, *args, **kwargs) | ||
if is_scalar(result) or len(result) != len(self): | ||
raise ValueError("transforms cannot produce " | ||
"aggregated results") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the aggregated result should be broadcasted to the full DataFrame? (as is done for groupby.transform)
except TypeError: | ||
pass | ||
if result is None: | ||
return self.apply(func, axis=axis, args=args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said before, I don't think we should do this (allow non-aggregating results in agg
, I suppose that is the meaning of result=None?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see that this is also for aggregating lambda's .. and not only for non-aggregating functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it doesn't broadcast things
try: | ||
result = self.apply(func, *args, **kwargs) | ||
except (ValueError, AttributeError, TypeError): | ||
result = func(self, *args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example of where apply fails and this is needed? (i.e. is this covered in the test cases?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, something like this (its because Series will row-by-row application)
In [4]: Series(range(5)).apply(lambda x: x-x.min())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-e65bceda88a4> in <module>()
----> 1 Series(range(5)).apply(lambda x: x-x.min())
/Users/jreback/miniconda3/envs/pandas/lib/python3.5/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
2290 else:
2291 values = self.asobject
-> 2292 mapped = lib.map_infer(values, f, convert=convert_dtype)
2293
2294 if len(mapped) and isinstance(mapped[0], Series):
pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:66116)()
<ipython-input-4-e65bceda88a4> in <lambda>(x)
----> 1 Series(range(5)).apply(lambda x: x-x.min())
AttributeError: 'int' object has no attribute 'min'
|
||
result = self.frame.apply(np.sqrt) | ||
assert_frame_equal(result, expected) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add here also frame.transform('sqrt')
?
d3bc362
to
345dce8
Compare
@jorisvandenbossche any more comments? this is for 0.20 so will be in master for a bit. |
ecc1339
to
e58ead6
Compare
any remaining comments? @TomAugspurger @wesm @sinhrks @jorisvandenbossche all of your points were addressed. This may not be perfect and need tweaking, but I think banging on it in master for a while is useful. |
to fully illustrate the open points (on setup
|
73432fa
to
be3c065
Compare
doc/source/basics.rst
Outdated
|
||
.. ipython:: python | ||
|
||
tsdf.A.agg({'foo' : 'sum', 'bar': 'mean'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space after 'foo'
so since haven't had much conversation on this. I think we should simply merging as is. If there is objection to the current Certainly open to changing / fixing this. But need a concrete proposal. I don't see any hard in exactly replicating a current, well-defined API (even if there are some issues around the edges). Otherwise forward progress is lost. |
Sorry for my radio silence on this. Since there have been many opinions expressed let me dig in to the questions raised and try to weigh in with an additional perspective. |
3f0a869
to
f08371b
Compare
mini-example
Current
Alternate
|
41636b0
to
83f0499
Compare
function application that mimics the groupby(..).agg/.aggregate interface .apply is now a synonym for .agg, and will accept dict/list-likes for aggregations CLN: rename .name attr -> ._selection_name from SeriesGroupby for compat (didn't exist on DataFrameGroupBy) resolves conflicts w.r.t. setting .name on a groupby object closes pandas-dev#1623 closes pandas-dev#14464 custom .describe closes pandas-dev#14483 closes pandas-dev#15015 closes pandas-dev#7014
additional doc updates
merging. |
After all these years here I am... I was doing a dashboard for my job, and I wanted to show some statistical values for the revenues data frame using streamlit. But to add a sum line on the df.describe() I had to do it manually. So in my opinion, there should have at least a parameter do enable it, cause I know in most cases we don't want to use sum. But letting the user have an option would be great in my opinion. |
interface
.apply
is now a synonym for.agg/.aggregate
, and will accept dict/list-likesfor aggregations
Series.agg(['min', 'sqrt'])
sqrt
,log
)closes #1623
closes #14464
custom .describe. I included these issues because its is quite easy to now do custom .describe.
closes #14483
closes #7014
TODO:
Series.agg({'foo' : ['min', 'max']})
Series:
DataFrame
Not sure what I should do in cases like this. could skip, or maybe raise better message.