-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement some reductions for string Series #31757
Conversation
Just as a counterpoint do we really want to support all of these? |
The value of |
Agreed summing strings is a little odd, but is it worth implementing for the sake of consistency with Series of object dtype (for which this is a valid operation)? |
I don’t think consistency with object dtype is a goal for the string dtype. Even for min/max I’m not sure what those mean in a lot of cases, unless the answer is to fallback to Python semantics. My concern is I think that as an answer conflicts with the goal of creating a native string type |
Fair point about consistency not being important. I do think though that strings having an "order" to them is a pretty useful / natural concept (we'd probably want to allow sorting of strings, in which case we'd also want min and max) |
You can sort strings, and then min/max have a rather clear meaning IMO But it's true we certainly don't need to do this for consistency with object dtype, but because we think it is useful |
i would certain add operations that work now on object types - otherwise the dtypes won’t be used generally which is not great |
More thoughts on adding |
pandas/core/arrays/string_.py
Outdated
@@ -274,7 +274,16 @@ def astype(self, dtype, copy=True): | |||
return super().astype(dtype, copy) | |||
|
|||
def _reduce(self, name, skipna=True, **kwargs): | |||
raise TypeError(f"Cannot perform reduction '{name}' with string dtype") | |||
if name in ["min", "max", "sum"]: | |||
na_mask = isna(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the masking should be done inside the methods themselves, _reduce just dispatches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we implement these methods for StringArray in that case? The NA handling for PandasArray seems to be broken for string inputs, so it might have to get handled within each method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would say don't care about PandasArray too much (since PandasArray is not using pd.NA), and just implement the methods here on StringArray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason why the NA-handling wasn't working was due to an apparently long-standing bug in nanops.nanminmax which I think we can fix here: #18588. Basically we are filling NA with infinite values when taking the min or max, but this doesn't make sense for object dtypes and an error gets raised even if skipna is True.
If we fix that by explicitly masking the missing values instead, I believe we can just use this function directly in StringArray methods.
none_in_first_column_result = getattr(df[["A", "B"]], method)() | ||
none_in_second_column_result = getattr(df[["B", "A"]], method)() | ||
none_in_first_column_result = getattr(df[["A", "B"]], method)().sort_index() | ||
none_in_second_column_result = getattr(df[["B", "A"]], method)().sort_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously the column with the missing value was getting dropped from the result so it only had a single row and the order didn't matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
pandas/core/arrays/string_.py
Outdated
|
||
def max(self, axis=None, out=None, keepdims=False, skipna=True): | ||
nv.validate_max((), dict(out=out, keepdims=keepdims)) | ||
result = nanops.nanmax(self._ndarray, axis=axis, skipna=skipna) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no need to explicitly pass through the axis keyword, I think
pandas/core/nanops.py
Outdated
elif is_object_dtype(dtype) and values.ndim == 1 and na_mask.any(): | ||
# Need to explicitly mask NA values for object dtypes | ||
if skipna: | ||
result = getattr(values[~na_mask], meth)(axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This masking could also be done in the min
/max
functions? (as you had before?)
Or, another option might be to add a min/max function to mask_ops.py
, similarly as I am doing for sum
in #30982 (but it should be simpler for min/max, as those don't need to handle the min_count
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a benefit of having it here is that this also fixes a bug for Series: pd.Series(["a", np.nan]).min()
currently raises even though it shouldn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's a good point. Can you add a test for that, then?
Now, that aside, I think longer term we still want the separate min/max in mask_ops.py, so it can also be used for the int dtypes. But that can then certainly be done for a separate PR.
"min", | ||
"max", | ||
]: | ||
pytest.skip("These reductions are implemented") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you see if you can rather update this in test_string.py
? It might be we now need to subclass the ReduceTests instead of NoReduceTests.
(ideally the base tests remain dtype agnostic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By updating in test_string.py
do you mean adding tests using the fixtures data
and all_numeric_reductions
, only checking for the "correct" output (and skipping over those reductions that aren't yet implemented)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually looking at the base reduction tests now: they are not really written in a way that they will pass for strings.
But so you can copy this test to tests/extension/test_strings.py
(and so override the base one), and then do the string-array-specific adaptation there. It gives some duplication of the test code, but it's not long, and it clearer separation of concerns (the changes for string array are in test_string)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so we can remove the special cases for StringArray
in BaseNoReduceTests
without getting test failures, as long as they're handled in TestNoReduce
in test_string.py
? I'm not too familiar with how these particular tests actually get executed during CI
doc/source/whatsnew/v1.0.2.rst
Outdated
@@ -28,6 +28,11 @@ Fixed regressions | |||
Bug fixes | |||
~~~~~~~~~ | |||
|
|||
**ExtensionArray** | |||
|
|||
- Fixed issue where taking the minimum or maximum of a ``StringArray`` or ``Series`` with ``StringDtype`` type would raise. (:issue:`31746`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say .min()
or .max()
pandas/core/nanops.py
Outdated
@@ -854,6 +854,8 @@ def reduction( | |||
mask: Optional[np.ndarray] = None, | |||
) -> Dtype: | |||
|
|||
na_mask = isna(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should already have the mask (pass it in when you call this).
pandas/core/nanops.py
Outdated
@@ -864,6 +866,12 @@ def reduction( | |||
result.fill(np.nan) | |||
except (AttributeError, TypeError, ValueError): | |||
result = np.nan | |||
elif is_object_dtype(dtype) and values.ndim == 1 and na_mask.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have a test case that fails on non ndim==1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, was getting a couple test failures otherwise, I think for reductions when the entire DataFrame
has object dtype (I can't recall which tests exactly). I figured the subsetting values[~mask]
is only going to make sense if values
has one dimension.
"min", | ||
"max", | ||
]: | ||
pytest.skip("These reductions are implemented") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add here a comment saying that those are tested in tests/arrays/test_string.py ?
Co-Authored-By: Joris Van den Bossche <jorisvandenbossche@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more comments
@@ -228,7 +228,9 @@ def _maybe_get_mask( | |||
# Boolean data cannot contain nulls, so signal via mask being None | |||
return None | |||
|
|||
if skipna: | |||
if skipna or is_object_dtype(values.dtype): | |||
# The masking in nanminmax does not work for object dtype, so always |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than do this, what exactly s the issue? 'does not work' is not very descriptive and generally we don't put comments like this, we simply fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what nanops.nanminmax appears to do when taking the min or max in the presence of missing values is to fill them with an appropriate infinite number that has the effect of ignoring those missing values (if we're taking the minimum replace with positive infinity, if we're taking the max replace with negative infinity). The problem is that this makes no sense for strings (there is as far as I know no "infinite string"), and that's why we get the error about comparing floats (infinity) and strings. The easiest workaround seems to be to mask them out instead.
To make things more complicated the _get_values
function in nanminmax
apparently doesn't bother to calculate a missing value mask when skipna
is False
because it's relying on the trick above working. Since it won't I'm making sure that we always get a mask for object dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len let's actualy fix this properly.
this is going to need either a branch for object dtypes and perf tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly as #30982, I would personally rather have a separate implementation for one using masks instead of the filling logic of the nanops (with sharing helper functions where appropriate), instead of trying to fit it into the existing nanops code (which gives those special cases / complex if checks like the one below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche Fair point for the string dtype, although I think some kind of logic like this would be necessary to fix min / max
for object strings.
Edit: Actually looks like maybe you're already addressing this in your PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some kind of logic like this would be necessary to fix min / max for object strings.
Yep, indeed, that's what you mentioned as argument before as well for doing it in nanops, so the non-extension array object dtype would benefit as well. For this, we could also add a check for object dtype, and then calculate the mask and use the masked op implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it might also be worth trying to refactor nanminmax in nanops to use masking in general instead of the current filling approach (from what I could tell this was only really needed for arrays with more than one dimension)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
and mask is not None | ||
and mask.any() | ||
): | ||
# Need to explicitly mask NA values for object dtypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
@@ -865,6 +867,17 @@ def reduction( | |||
result.fill(np.nan) | |||
except (AttributeError, TypeError, ValueError): | |||
result = np.nan | |||
elif ( | |||
is_object_dtype(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need all of these condtions? this is complicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only looking at objects is for the reason above, values.ndim == 1
is because I think we can get here even if values
is not vector-valued, which is more or less what we're assuming when we just mask them out (if we don't include this we can get test failures when we have an all-object DataFrame
), mask.any()
is because this function already works if no values are missing so there's no reason to do masking, and mask is not None
is to please mypy
(we've already guaranteed that mask
isn't None
for objects above but mypy
doesn't know this).
I can try to be a bit more explicit in the comments if that would be helpful.
moving this off 1.0.2 as this has raised some non-trivial to solve issues. if you want a bug fix out of this ok, but then this needs to decouple the issues. |
@@ -59,6 +59,10 @@ Previously indexing with a nullable Boolean array containing ``NA`` would raise | |||
Bug fixes | |||
~~~~~~~~~ | |||
|
|||
**ExtensionArray** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is likely too invasive for 1.02, move to 1.1
Going to close this for now since it looks to be quite a bit larger than I originally thought (seems like fixing nanmin / nanmax might be the right approach). I'm not sure how best to implement this, but another idea might be to fill with an arbitrary non-NaN value from each array along the given axis before taking the min or max, then I think it should work for any dtype (although might be a little slower). |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff