Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Move any/all to NDFrame, support additional arguments for Series. GH8302 #8550

Merged
merged 1 commit into from
Nov 11, 2014

Conversation

staple
Copy link
Contributor

@staple staple commented Oct 13, 2014

closes #8302

  • Move any/all implementations from DataFrame to NDFrame, adding support for Series
  • Special case single dimension case in NDFrame, to handle numpy.any/all specific arguments
  • Add assorted NotImplementedErrors in cases where arguments are ignored (including in preexisting _reduce functions)
  • Move the other any/all implementations from IndexOpsMixin to Index, outside of Series’ inheritance tree

@staple staple force-pushed the GH8302 branch 2 times, most recently from 5bfd86c to bf46795 Compare October 15, 2014 18:42
@jreback jreback added API Design Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 20, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 20, 2014
@jreback jreback removed the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Oct 20, 2014
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a
particular level, collapsing into a """ + name + """
bool_only : boolean, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These logical functions have slightly different documentation: a 'bool_only' field instead of a 'numeric_only' field, and potentially a message about supporting additional numpy arguments via the bool_extended_args variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary (nor is any other 'bool_extended_args' variable), its just not needed. That said If you have a use case pls show it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to better understand your comment. I mentioned two special cases in the documentation here, the bool_only argument name and the documentation message about supporting additional numpy arguments, which are currently supported in master's Series.any/all, which forwards to numpy's any/all.

  1. bool_only is the preexisting argument name for DataFrame's any and all, in master. Are you suggesting I should change it to numeric_only?
  2. You had suggested earlier that I should continue to support the ndarray.any/all arguments that are supported by Series.any/all in master (these are the 'out' and 'keepdims' parameters). Are you saying that I should not support these, or that I should not document that they are supported?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. maybe I wasn't clear. The signature should be something like: any(laxis, skipna, level, **kwargs); we accept but don't deal explicity with the numpy args (e.g. out) and such. Its not useful, nor is it consistent with how pandas works.

  2. I realized I was confusing about bool_only, yes that is a good idea (just not bool_extended_args

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, got it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't bool_only=True be the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made bool_only default to None (False) here because:

  • The default is None in DataFrame’s current implementation of any/all in master
  • The corresponding numeric_only argument to sum, mean, etc defaults to None
  • bool_only is not currently implemented in Series._reduce or Panel._reduce. Once we add not implemented errors, they would be thrown in the case of all default arguments, were bool_only to be True by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, though I find its a little odd (e.g. I can't think of a reason to use any/all on non-boolean data), but I am sure people will try it!.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, also, if they're calling any/all on Series in master, they are using the numpy version which does not offer a bool_only option.

Would it make sense for me to create a new issue for implementing bool_only (numeric_only) in Series an Panel, and look into whether it should be enabled by default for any/all?

@staple
Copy link
Contributor Author

staple commented Oct 21, 2014

Hi, here is the separate PR for the _reduce changes: #8592

@staple
Copy link
Contributor Author

staple commented Oct 21, 2014

Ok, I think I've addressed the pending items.

Some tests are currently disabled pending merge of the _reduce changes I separated into the other PR.

@jreback
Copy link
Contributor

jreback commented Oct 27, 2014

pls add a release note v0.15.1 (put in API section for now, though that's getting reorganized). This is not really an API change, more of an internal change.

pls rebase/squash and I'll have another look.

@staple
Copy link
Contributor Author

staple commented Oct 27, 2014

Sure, there are some minor dependencies on the _reduce PR we've also been discussing (some tests are disabled). Since it sounds like we're closed to finishing up that PR, I'll plan to rebase/squash this one once the other is in master.

@jreback
Copy link
Contributor

jreback commented Oct 27, 2014

@staple ideally I'd like to see a SINGLE any/all that goes in NDFrame AND Index (so we have argument consistency). This is tricky of course. because you have multi-dim objects. I think this CAN be done though.

@staple
Copy link
Contributor Author

staple commented Oct 29, 2014

Ok, updated. Go ahead and take a look.


Parameters
----------
axis : None or int or tuple of ints, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index are only 1-dim by definition, so can remove this parm.

@staple
Copy link
Contributor Author

staple commented Nov 2, 2014

Ok, I made the requested changes.


.. ipython:: python

p = pd.Panel(np.random.rand(2, 5, 4) > 0.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this example here? (the 2nd one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panel now supports any and all. Panel didn't implement these functions before, so I added an example of how to call them. Do you want me to remove the example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, say that then. (I didn't know that!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already in the description:
Panel now supports the all and any aggregation functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh ok, (I think that line is too long. Better to move that part to right before the example. Follows better on the eye.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

assert_series_equal(s.any(level=0), Series([False, True, True]))

# bool_only is not implemented with level option.
self.assertRaises(NotImplementedError, s.any, bool_only=True, level=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I don't buy either one of these. Too special case. They should just work (if its a bool dtype then it will work, otherwise it should fail)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you be more specific about which test cases your are referring to? Are you asking me to remove the tests for the Series behavior where there nan values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the level and bool_only raises NotImplementedError. was this from somewhere else? I think that should work, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, do you mean the bool_only cases? Do you mean I should remove the bool_only argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, bool_only uses _reduce's numeric_only which was never implemented for Series or Panel. (Recall recent PR where I added exceptions in those cases instead of ignoring the argument.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think _agg_by_level is one issue (currently numeric_only isn't passed). But also Series._reduce just doesn't implement numeric_only in what I think is the standard case:

        if numeric_only:
            raise NotImplementedError(
                'Series.{0} does not implement numeric_only.'.format(name))

Is your suggestion that we should add support for these? That could be a separate PR (which would probably be merged w/ master before this one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for example (note use of sum not any):

>>> pd.Series([1]).sum(numeric_only=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/generic.py", line 3938, in stat_func
    skipna=skipna, numeric_only=numeric_only)
  File "pandas/core/series.py", line 2074, in _reduce
    'Series.{0} does not implement numeric_only.'.format(name))
NotImplementedError: Series.sum does not implement numeric_only.

Are you suggesting to fix this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, are you saying make sure bool_only is passed through but don't implement numeric_only in Series._reduce (so that in theory the combination of bool_only and level is no longer a special case)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok, let's leave as its, but create a new issue to discuss / fix this. In that issue show the examples where it is not-implemented ATM (so the above, and the 2 in your tests).
We'll see what we can do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though I think you should pass thru the bool_only arg now (and thus eliminate 1 case).

I like to eliminate all special cases if they don't actually make sense (we can look at this in detail later in the new issue).

@jreback
Copy link
Contributor

jreback commented Nov 2, 2014

@staple FYI this PR is really good. I am just being nit-picky! on some details.

@jreback jreback modified the milestones: 0.15.1, 0.15.2 Nov 2, 2014
@staple
Copy link
Contributor Author

staple commented Nov 2, 2014

@jreback, sure no worries!

@jreback
Copy link
Contributor

jreback commented Nov 5, 2014

can you fixup and rebase as discussed. e.g. remove args,*kwargs to Index funcs (and remove keep_dims tests). change test_logical_compat to test the functions are actually disabled on the various Indexes. thxs

@jreback jreback modified the milestones: 0.15.1, 0.15.2 Nov 6, 2014
@staple staple force-pushed the GH8302 branch 2 times, most recently from 65011e4 to f37e2d8 Compare November 7, 2014 05:52
@staple
Copy link
Contributor Author

staple commented Nov 7, 2014

Ok, see new commit.

@@ -211,6 +211,19 @@ API changes
in plotted integer values. To keep the previous behaviour, you can do
``del matplotlib.units.registry[np.datetime64]`` (:issue:`8614`).

- ``Series.all`` and ``Series.any`` now support the ``level`` and ``skipna`` parameters. ``Series.all``, ``Series.any``, ``Index.all``, and ``Index.any`` no longer support the ``out`` and ``keepdims`` parameters, which existed for compatibility with ndarray. Various index types no longer support the ``all`` and ``any`` aggregation functions. (:issue:`8302`):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move to v0.15.2.txt

@staple staple force-pushed the GH8302 branch 3 times, most recently from 992a882 to 518a62c Compare November 10, 2014 16:46
@staple
Copy link
Contributor Author

staple commented Nov 10, 2014

Ok

@jreback jreback merged commit 518a62c into pandas-dev:master Nov 11, 2014
@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

@staple thanks! This was a bigger rabbit hole, but you got it done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: add level kwarg for Series.any/.all
2 participants