-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: tolerance now takes list-like argument for reindex and get_indexer. #17367
Conversation
Hello @buntwo! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on October 14, 2017 at 19:10 Hours UTC |
does this have an associated issue? can u show an example use case |
Use case: suppose you have a timeseries of values for a stock, with some days with missing data in between, and you want to select some date range that includes the missing data. You want to have some fill backward (forward, nearest) N-many-days logic when doing so, but you want N to depend on some other property of the stock that varies as time. Currently pandas only supports N constant. This features enables this use case, by allowing the user to construct a list-like object of the same size that contains the tolerance per point and pass that in. |
@buntwo : It seems like you want the tolerance time-varying. I suppose that could be useful, but I haven't seen a use-case like this really. I would suggest you open an issue for this and point your PR to it so we can discuss. In the future, for enhancements like this, it would be preferable to open an issue first so that we can discuss rather than in the PR. |
Codecov Report
@@ Coverage Diff @@
## master #17367 +/- ##
=========================================
Coverage ? 91.02%
=========================================
Files ? 162
Lines ? 49603
Branches ? 0
=========================================
Hits ? 45150
Misses ? 4453
Partials ? 0
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17367 +/- ##
==========================================
- Coverage 91.24% 91.21% -0.03%
==========================================
Files 163 163
Lines 50080 50099 +19
==========================================
+ Hits 45693 45700 +7
- Misses 4387 4399 +12
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments in-line. you are doing quite a bit of extra checking, I think you can pare this down substantially. you may need to adjust your tests slightly as well.
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -130,8 +130,7 @@ Other Enhancements | |||
- `read_*` methods can now infer compression from non-string paths, such as ``pathlib.Path`` objects (:issue:`17206`). | |||
- :func:`pd.read_sas()` now recognizes much more of the most frequently used date (datetime) formats in SAS7BDAT files (:issue:`15871`). | |||
- :func:`DataFrame.items` and :func:`Series.items` is now present in both Python 2 and 3 and is lazy in all cases (:issue:`13918`, :issue:`17213`) | |||
|
|||
|
|||
- :func:`Series.reindex`, :func:`DataFrame.reindex`, :func:`Index.get_indexer` now support list-like argument for `tolerance`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double back-ticks on tolerance
pandas/core/generic.py
Outdated
@@ -2658,8 +2659,14 @@ def sort_index(self, axis=0, level=None, ascending=True, inplace=False, | |||
Maximum distance between original and new labels for inexact | |||
matches. The values of the index at the matching locations most | |||
satisfy the equation ``abs(index[indexer] - target) <= tolerance``. | |||
Tolerance may be a scalar value, which applies the same tolerance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a blank line where you have new text (I think it will render ok)
pandas/core/generic.py
Outdated
@@ -2881,8 +2888,14 @@ def _reindex_multi(self, axes, copy, fill_value): | |||
Maximum distance between original and new labels for inexact | |||
matches. The values of the index at the matching locations most | |||
satisfy the equation ``abs(index[indexer] - target) <= tolerance``. | |||
Tolerance may be a scalar value, which applies the same tolerance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same (bonus points if we can use a _shared_docs here to avoid code duplication)
pandas/core/indexes/base.py
Outdated
@@ -2444,9 +2443,14 @@ def _get_unique_index(self, dropna=False): | |||
tolerance : optional | |||
Maximum distance from index value for inexact matches. The value of | |||
the index at the matching location most satisfy the equation | |||
``abs(index[loc] - key) <= tolerance``. | |||
``abs(index[loc] - key) <= tolerance``. Tolerance may be a scalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/core/indexes/base.py
Outdated
@@ -2566,8 +2570,14 @@ def _get_level_values(self, level): | |||
Maximum distance between original and new labels for inexact | |||
matches. The values of the index at the matching locations most | |||
satisfy the equation ``abs(index[indexer] - target) <= tolerance``. | |||
Tolerance may be a scalar value, which applies the same tolerance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same. actually let's fix the shared_docs things first in another PR (or here is ok too). too much doc-string duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. How would I do that though? This (and the 3 others above) is already in a _shared_docs, so this would require another layer of abstraction above _shared_docs. Quite a bit of duplication with the existing tolerance docstrings as well.
pandas/core/indexes/datetimes.py
Outdated
except KeyError: | ||
raise KeyError(key) | ||
except ValueError as e: | ||
# ndarray tolerance size must match target index size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list like tolerance size mismatch raises a ValueError, but so does something else, so I differentiate here...definitely a better way to do this.
pandas/core/indexes/numeric.py
Outdated
except ValueError: | ||
raise ValueError('tolerance argument for %s must be numeric: %r' % | ||
(type(self).__name__, tolerance)) | ||
tolerance = _list_to_ndarray(tolerance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above, just use np.asarray(...)
, no need for the rest of this
pandas/core/indexes/period.py
Outdated
offset = frequencies.to_offset(self.freq.rule_code) | ||
if isinstance(offset, offsets.Tick): | ||
nanos = tslib._delta_to_nanoseconds(other) | ||
offset_nanos = tslib._delta_to_nanoseconds(offset) | ||
if nanos % offset_nanos == 0: | ||
if isinstance(other, np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just d
check = np.all(nanos % offset_nanos == 0)
which will work for scalars as well
pandas/core/indexes/period.py
Outdated
@@ -775,6 +780,10 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None): | |||
|
|||
if tolerance is not None: | |||
tolerance = self._convert_tolerance(tolerance) | |||
if isinstance(tolerance, np.ndarray) and \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check should already be in _convert_tolerance (I am not sure I mentioned above)
pandas/core/indexes/period.py
Outdated
@@ -902,6 +911,10 @@ def _get_string_slice(self, key): | |||
|
|||
def _convert_tolerance(self, tolerance): | |||
tolerance = DatetimeIndexOpsMixin._convert_tolerance(self, tolerance) | |||
if isinstance(tolerance, np.ndarray) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
The failed test in appveyor I think has nothing to do with my change, didn't fail on the last commit. |
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -234,6 +234,7 @@ Other Enhancements | |||
- :meth:`DataFrame.assign` will preserve the original order of ``**kwargs`` for Python 3.6+ users instead of sorting the column names. (:issue:`14207`) | |||
- Improved the import time of pandas by about 2.25x. (:issue:`16764`) | |||
- :func:`read_json` and :func:`to_json` now accept a ``compression`` argument which allows them to transparently handle compressed files. (:issue:`17798`) | |||
- :func:`Series.reindex`, :func:`DataFrame.reindex`, :func:`Index.get_indexer` now support list-like argument for ``tolerance`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number (or PR number if no issue)
pandas/core/tools/timedeltas.py
Outdated
elif is_list_like(arg) and getattr(arg, 'ndim', 1) <= 1: | ||
if getattr(arg, 'ndim', 1) == 0: | ||
# extract array scalar and process below | ||
arg = arg[()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use .item()
pandas/tests/indexes/test_base.py
Outdated
actual = idx.get_indexer([0.2, 1.8, 8.5], method=method, | ||
tolerance=[0.5, 0.2, 0.2]) | ||
tm.assert_numpy_array_equal(actual, np.array(expected, | ||
dtype=np.intp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u use parametrize here to make the tests more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be a separate test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little of unsure of what to parameterize here: in each for loop, I actually just copy the last test (actual = idx.(...
, tm.assert_numpy_array_equal(...
) using scalar tolerance but change the tolerance to an array. I could add some comments to make it clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parametrize on method,expected
pandas/core/indexes/period.py
Outdated
@@ -783,6 +788,9 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None): | |||
|
|||
if tolerance is not None: | |||
tolerance = self._convert_tolerance(tolerance) | |||
if target.size != tolerance.size and tolerance.size > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think would also pass target to _convert_tolerance so
can move error checks there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you would change all _convert_tolerance to pass target as well
to
offset = frequencies.to_offset(self.freq.rule_code) | ||
if isinstance(offset, offsets.Tick): | ||
nanos = tslib._delta_to_nanoseconds(other) | ||
if isinstance(other, np.ndarray): | ||
nanos = np.vectorize(tslib._delta_to_nanoseconds)(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the array version of this function is almost trivial
if u can add it alongside the other and call here
(u just need to type the input as ndarray i think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually ignore the above this is ok here
pandas/tests/indexes/test_base.py
Outdated
[[0, 2, -1], [0, -1, -1], | ||
[-1, 2, 9]]): | ||
# list | ||
actual = idx.get_indexer([0.2, 1.8, 8.5], method='nearest', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also parametrize on the type of list-like
@buntwo ok parametrized the tests, ping on green. |
going to fail from some flake errors
|
also rebase on master (you don't need to squash, though its easier to do so) |
6535e14
to
e190435
Compare
your rebasing is removing things. just leave it and restore code (in a new commit) that was removed. making sure all tests pass. |
Apologies for the mistake in rebasing. I used git diff to cherry pick the unintended changes and git apply -R to revert. |
pandas/core/tools/timedeltas.py
Outdated
@@ -83,8 +83,12 @@ def to_timedelta(arg, unit='ns', box=True, errors='raise'): | |||
elif isinstance(arg, ABCIndexClass): | |||
return _convert_listlike(arg, unit=unit, box=box, | |||
errors=errors, name=arg.name) | |||
elif is_list_like(arg) and getattr(arg, 'ndim', 1) == 1: | |||
return _convert_listlike(arg, unit=unit, box=box, errors=errors) | |||
elif is_list_like(arg) and getattr(arg, 'ndim', 1) <= 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you separate these into 2 conditions on the main if/else, IOW
elif is_list_like(arg) and getattr(arg, 'ndim', 1) == 0:
arg = arg.item()
# this would then be unchanged
elif is_list_like(arg) and getattr(arg, 'ndim', 1) == 1:
return _convert.....
Ping |
thanks @buntwo |
thank you @jreback ! |
* upstream/master: (76 commits) CategoricalDtype construction: actually use fastpath (pandas-dev#17891) DEPR: Deprecate tupleize_cols in to_csv (pandas-dev#17877) BUG: Fix wrong column selection in drop_duplicates when duplicate column names (pandas-dev#17879) DOC: Adding examples to update docstring (pandas-dev#16812) (pandas-dev#17859) TST: Skip if no openpyxl in test_excel (pandas-dev#17883) TST: Catch read_html slow test warning (pandas-dev#17874) flake8 cleanup (pandas-dev#17873) TST: remove moar warnings (pandas-dev#17872) ENH: tolerance now takes list-like argument for reindex and get_indexer. (pandas-dev#17367) ERR: Raise ValueError when week is passed in to_datetime format witho… (pandas-dev#17819) TST: remove some deprecation warnings (pandas-dev#17870) Refactor index-as-string groupby tests and fix spurious warning (Bug 17383) (pandas-dev#17843) BUG: merging with a boolean/int categorical column (pandas-dev#17841) DEPR: Deprecate read_csv arguments fully (pandas-dev#17865) BUG: to_json - prevent various segfault conditions (GH14256) (pandas-dev#17857) CLN: Use pandas.core.common for None checks (pandas-dev#17816) BUG: set tz on DTI from fixed format HDFStore (pandas-dev#17844) RLS: v0.21.0rc1 Whatsnew cleanup (pandas-dev#17858) DEPR: Deprecate the convert parameter completely (pandas-dev#17831) ...
Enable use of list-like values for tolerance argument in DataFrame.reindex(), Series.reindex(), Index.get_indexer().
git diff upstream/master -u -- "*.py" | flake8 --diff