ENH: tolerance now takes list-like argument for reindex and get_indexer. #17367

buntwo · 2017-08-29T01:47:21Z

Enable use of list-like values for tolerance argument in DataFrame.reindex(), Series.reindex(), Index.get_indexer().

tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2017-08-29T01:47:36Z

Hello @buntwo! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on October 14, 2017 at 19:10 Hours UTC

jreback · 2017-08-29T01:52:07Z

does this have an associated issue?

can u show an example use case

buntwo · 2017-08-29T03:46:38Z

Use case: suppose you have a timeseries of values for a stock, with some days with missing data in between, and you want to select some date range that includes the missing data. You want to have some fill backward (forward, nearest) N-many-days logic when doing so, but you want N to depend on some other property of the stock that varies as time. Currently pandas only supports N constant. This features enables this use case, by allowing the user to construct a list-like object of the same size that contains the tolerance per point and pass that in.

gfyoung · 2017-08-29T07:29:01Z

@buntwo : It seems like you want the tolerance time-varying. I suppose that could be useful, but I haven't seen a use-case like this really.

I would suggest you open an issue for this and point your PR to it so we can discuss.

In the future, for enhancements like this, it would be preferable to open an issue first so that we can discuss rather than in the PR.

codecov · 2017-09-01T03:12:01Z

Codecov Report

❗ No coverage uploaded for pull request base (master@36dadd7). Click here to learn what that means.
The diff coverage is 95.91%.

@@            Coverage Diff            @@
##             master   #17367   +/-   ##
=========================================
  Coverage          ?   91.02%           
=========================================
  Files             ?      162           
  Lines             ?    49603           
  Branches          ?        0           
=========================================
  Hits              ?    45150           
  Misses            ?     4453           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`88.8% <95.91%> (?)`
#single	`40.22% <10.2%> (?)`

Impacted Files	Coverage Δ
pandas/core/generic.py	`91.99% <ø> (ø)`
pandas/core/indexes/datetimelike.py	`96.75% <100%> (ø)`
pandas/core/indexes/numeric.py	`97.24% <100%> (ø)`
pandas/core/indexes/datetimes.py	`95.35% <100%> (ø)`
pandas/core/indexes/base.py	`95.89% <100%> (ø)`
pandas/core/indexes/period.py	`92.5% <77.77%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36dadd7...d6ec1d6. Read the comment docs.

codecov · 2017-09-01T03:12:07Z

Codecov Report

Merging #17367 into master will decrease coverage by 0.02%.
The diff coverage is 92.68%.

@@            Coverage Diff             @@
##           master   #17367      +/-   ##
==========================================
- Coverage   91.24%   91.21%   -0.03%     
==========================================
  Files         163      163              
  Lines       50080    50099      +19     
==========================================
+ Hits        45693    45700       +7     
- Misses       4387     4399      +12

Flag	Coverage Δ
#multiple	`89.02% <92.68%> (-0.01%)`	⬇️
#single	`40.31% <21.95%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/generic.py	`92.2% <ø> (ø)`	⬆️
pandas/core/indexes/timedeltas.py	`91.19% <100%> (ø)`	⬆️
pandas/core/tools/timedeltas.py	`98.38% <100%> (+0.05%)`	⬆️
pandas/core/indexes/datetimelike.py	`97.1% <100%> (ø)`	⬆️
pandas/core/indexes/numeric.py	`97.23% <100%> (+0.05%)`	⬆️
pandas/core/indexes/base.py	`96.42% <80%> (-0.06%)`	⬇️
pandas/core/indexes/datetimes.py	`95.5% <83.33%> (-0.08%)`	⬇️
pandas/core/indexes/period.py	`92.68% <91.66%> (-0.11%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.1%)`	⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e001500...7e7051a. Read the comment docs.

jreback

comments in-line. you are doing quite a bit of extra checking, I think you can pare this down substantially. you may need to adjust your tests slightly as well.

jreback · 2017-09-01T19:10:57Z

doc/source/whatsnew/v0.21.0.txt

@@ -130,8 +130,7 @@ Other Enhancements
 - `read_*` methods can now infer compression from non-string paths, such as ``pathlib.Path`` objects (:issue:`17206`).
 - :func:`pd.read_sas()` now recognizes much more of the most frequently used date (datetime) formats in SAS7BDAT files (:issue:`15871`).
 - :func:`DataFrame.items` and :func:`Series.items` is now present in both Python 2 and 3 and is lazy in all cases (:issue:`13918`, :issue:`17213`)
-
-
+- :func:`Series.reindex`, :func:`DataFrame.reindex`, :func:`Index.get_indexer` now support list-like argument for `tolerance`.


double back-ticks on tolerance

jreback · 2017-09-01T19:11:25Z

pandas/core/generic.py

@@ -2658,8 +2659,14 @@ def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
            Maximum distance between original and new labels for inexact
            matches. The values of the index at the matching locations most
            satisfy the equation ``abs(index[indexer] - target) <= tolerance``.
+            Tolerance may be a scalar value, which applies the same tolerance


add a blank line where you have new text (I think it will render ok)

jreback · 2017-09-01T19:11:46Z

pandas/core/generic.py

@@ -2881,8 +2888,14 @@ def _reindex_multi(self, axes, copy, fill_value):
            Maximum distance between original and new labels for inexact
            matches. The values of the index at the matching locations most
            satisfy the equation ``abs(index[indexer] - target) <= tolerance``.
+            Tolerance may be a scalar value, which applies the same tolerance


same (bonus points if we can use a _shared_docs here to avoid code duplication)

jreback · 2017-09-01T19:11:52Z

pandas/core/indexes/base.py

@@ -2444,9 +2443,14 @@ def _get_unique_index(self, dropna=False):
        tolerance : optional
            Maximum distance from index value for inexact matches. The value of
            the index at the matching location most satisfy the equation
-            ``abs(index[loc] - key) <= tolerance``.
+            ``abs(index[loc] - key) <= tolerance``. Tolerance may be a scalar


jreback · 2017-09-01T19:12:20Z

pandas/core/indexes/base.py

@@ -2566,8 +2570,14 @@ def _get_level_values(self, level):
            Maximum distance between original and new labels for inexact
            matches. The values of the index at the matching locations most
            satisfy the equation ``abs(index[indexer] - target) <= tolerance``.
+            Tolerance may be a scalar value, which applies the same tolerance


same. actually let's fix the shared_docs things first in another PR (or here is ok too). too much doc-string duplication.

Makes sense. How would I do that though? This (and the 3 others above) is already in a _shared_docs, so this would require another layer of abstraction above _shared_docs. Quite a bit of duplication with the existing tolerance docstrings as well.

jreback · 2017-09-01T19:17:04Z

pandas/core/indexes/datetimes.py

+            except KeyError:
+                raise KeyError(key)
+            except ValueError as e:
+                # ndarray tolerance size must match target index size


what is this for?

list like tolerance size mismatch raises a ValueError, but so does something else, so I differentiate here...definitely a better way to do this.

jreback · 2017-09-01T19:17:30Z

pandas/core/indexes/numeric.py

-        except ValueError:
-            raise ValueError('tolerance argument for %s must be numeric: %r' %
-                             (type(self).__name__, tolerance))
+        tolerance = _list_to_ndarray(tolerance)


see above, just use np.asarray(...), no need for the rest of this

jreback · 2017-09-01T19:18:19Z

pandas/core/indexes/period.py

            offset = frequencies.to_offset(self.freq.rule_code)
            if isinstance(offset, offsets.Tick):
                nanos = tslib._delta_to_nanoseconds(other)
                offset_nanos = tslib._delta_to_nanoseconds(offset)
-                if nanos % offset_nanos == 0:
+                if isinstance(other, np.ndarray):


I think you can just d

check = np.all(nanos % offset_nanos == 0)
which will work for scalars as well

jreback · 2017-09-01T19:18:38Z

pandas/core/indexes/period.py

@@ -775,6 +780,10 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None):

        if tolerance is not None:
            tolerance = self._convert_tolerance(tolerance)
+            if isinstance(tolerance, np.ndarray) and \


check should already be in _convert_tolerance (I am not sure I mentioned above)

jreback · 2017-09-01T19:18:45Z

pandas/core/indexes/period.py

@@ -902,6 +911,10 @@ def _get_string_slice(self, key):

    def _convert_tolerance(self, tolerance):
        tolerance = DatetimeIndexOpsMixin._convert_tolerance(self, tolerance)
+        if isinstance(tolerance, np.ndarray) \


closes pandas-dev#17367

buntwo · 2017-10-14T06:25:55Z

The failed test in appveyor I think has nothing to do with my change, didn't fail on the last commit.

jreback · 2017-10-14T12:18:59Z

doc/source/whatsnew/v0.21.0.txt

@@ -234,6 +234,7 @@ Other Enhancements
 - :meth:`DataFrame.assign` will preserve the original order of ``**kwargs`` for Python 3.6+ users instead of sorting the column names. (:issue:`14207`)
 - Improved the import time of pandas by about 2.25x.  (:issue:`16764`)
 - :func:`read_json` and :func:`to_json` now accept a ``compression`` argument which allows them to transparently handle compressed files. (:issue:`17798`)
+- :func:`Series.reindex`, :func:`DataFrame.reindex`, :func:`Index.get_indexer` now support list-like argument for ``tolerance``


add the issue number (or PR number if no issue)

jreback · 2017-10-14T12:23:38Z

pandas/core/tools/timedeltas.py

+    elif is_list_like(arg) and getattr(arg, 'ndim', 1) <= 1:
+        if getattr(arg, 'ndim', 1) == 0:
+            # extract array scalar and process below
+            arg = arg[()]


use .item()

jreback · 2017-10-14T12:26:20Z

pandas/tests/indexes/test_base.py

+            actual = idx.get_indexer([0.2, 1.8, 8.5], method=method,
+                                     tolerance=[0.5, 0.2, 0.2])
+            tm.assert_numpy_array_equal(actual, np.array(expected,
+                                                         dtype=np.intp))


can u use parametrize here to make the tests more readable

can be a separate test

I'm a little of unsure of what to parameterize here: in each for loop, I actually just copy the last test (actual = idx.(..., tm.assert_numpy_array_equal(...) using scalar tolerance but change the tolerance to an array. I could add some comments to make it clearer?

parametrize on method,expected

jreback · 2017-10-14T12:28:19Z

pandas/core/indexes/period.py

@@ -783,6 +788,9 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None):

        if tolerance is not None:
            tolerance = self._convert_tolerance(tolerance)
+            if target.size != tolerance.size and tolerance.size > 1:


i think would also pass target to _convert_tolerance so
can move error checks there

you would change all _convert_tolerance to pass target as well
to

jreback · 2017-10-14T12:33:28Z

pandas/core/indexes/period.py

            offset = frequencies.to_offset(self.freq.rule_code)
            if isinstance(offset, offsets.Tick):
-                nanos = tslib._delta_to_nanoseconds(other)
+                if isinstance(other, np.ndarray):
+                    nanos = np.vectorize(tslib._delta_to_nanoseconds)(other)


the array version of this function is almost trivial
if u can add it alongside the other and call here
(u just need to type the input as ndarray i think)

actually ignore the above this is ok here

jreback · 2017-10-14T13:16:25Z

pandas/tests/indexes/test_base.py

+                                       [[0, 2, -1], [0, -1, -1],
+                                        [-1, 2, 9]]):
+            # list
+            actual = idx.get_indexer([0.2, 1.8, 8.5], method='nearest',


could also parametrize on the type of list-like

jreback · 2017-10-14T14:33:43Z

@buntwo ok parametrized the tests, ping on green.

jreback · 2017-10-14T15:05:14Z

going to fail from some flake errors

git diff master -u "*.py" | flake8 --diff
will show you them

jreback · 2017-10-14T15:05:31Z

also rebase on master (you don't need to squash, though its easier to do so)

closes pandas-dev#17618

jreback · 2017-10-14T16:15:37Z

your rebasing is removing things. just leave it and restore code (in a new commit) that was removed. making sure all tests pass.

buntwo · 2017-10-14T18:00:54Z

Apologies for the mistake in rebasing. I used git diff to cherry pick the unintended changes and git apply -R to revert.

jreback · 2017-10-14T19:02:24Z

pandas/core/tools/timedeltas.py

@@ -83,8 +83,12 @@ def to_timedelta(arg, unit='ns', box=True, errors='raise'):
    elif isinstance(arg, ABCIndexClass):
        return _convert_listlike(arg, unit=unit, box=box,
                                 errors=errors, name=arg.name)
-    elif is_list_like(arg) and getattr(arg, 'ndim', 1) == 1:
-        return _convert_listlike(arg, unit=unit, box=box, errors=errors)
+    elif is_list_like(arg) and getattr(arg, 'ndim', 1) <= 1:


can you separate these into 2 conditions on the main if/else, IOW

elif is_list_like(arg) and getattr(arg, 'ndim', 1) == 0: arg = arg.item() # this would then be unchanged elif is_list_like(arg) and getattr(arg, 'ndim', 1) == 1: return _convert.....

buntwo · 2017-10-14T20:47:14Z

Ping

jreback · 2017-10-14T21:06:36Z

thanks @buntwo

buntwo · 2017-10-14T21:07:08Z

thank you @jreback !

…er. (pandas-dev#17367)

* upstream/master: (76 commits) CategoricalDtype construction: actually use fastpath (pandas-dev#17891) DEPR: Deprecate tupleize_cols in to_csv (pandas-dev#17877) BUG: Fix wrong column selection in drop_duplicates when duplicate column names (pandas-dev#17879) DOC: Adding examples to update docstring (pandas-dev#16812) (pandas-dev#17859) TST: Skip if no openpyxl in test_excel (pandas-dev#17883) TST: Catch read_html slow test warning (pandas-dev#17874) flake8 cleanup (pandas-dev#17873) TST: remove moar warnings (pandas-dev#17872) ENH: tolerance now takes list-like argument for reindex and get_indexer. (pandas-dev#17367) ERR: Raise ValueError when week is passed in to_datetime format witho… (pandas-dev#17819) TST: remove some deprecation warnings (pandas-dev#17870) Refactor index-as-string groupby tests and fix spurious warning (Bug 17383) (pandas-dev#17843) BUG: merging with a boolean/int categorical column (pandas-dev#17841) DEPR: Deprecate read_csv arguments fully (pandas-dev#17865) BUG: to_json - prevent various segfault conditions (GH14256) (pandas-dev#17857) CLN: Use pandas.core.common for None checks (pandas-dev#17816) BUG: set tz on DTI from fixed format HDFStore (pandas-dev#17844) RLS: v0.21.0rc1 Whatsnew cleanup (pandas-dev#17858) DEPR: Deprecate the convert parameter completely (pandas-dev#17831) ...

…er. (pandas-dev#17367)

gfyoung added 2/3 Compat Indexing Related to indexing on series/frames, not to indexes themselves and removed 2/3 Compat labels Aug 29, 2017

buntwo closed this Sep 1, 2017

buntwo reopened this Sep 1, 2017

jreback requested changes Sep 1, 2017

View reviewed changes

jreback added a commit to jreback/pandas that referenced this pull request Sep 23, 2017

BUG: overflow on Timedelta construction & arithmetic now raises

294da30

closes pandas-dev#17367

jreback mentioned this pull request Sep 23, 2017

BUG: overflow on Timedelta construction & arithmetic now raises #17640

Merged

jreback requested changes Oct 14, 2017

View reviewed changes

jreback reviewed Oct 14, 2017

View reviewed changes

jreback added the Enhancement label Oct 14, 2017

jreback added this to the 0.21.0 milestone Oct 14, 2017

jreback approved these changes Oct 14, 2017

View reviewed changes

jjhelmus and others added 3 commits October 14, 2017 12:01

BUG: set tz on DTI from fixed format HDFStore (pandas-dev#17844)

bbe248c

closes pandas-dev#17618

CLN: Use pandas.core.common for None checks (pandas-dev#17816)

44b08f2

ENH: tolerance now takes list-like argument for reindex and get_indexer.

e190435

buntwo force-pushed the ndarray_tolerance branch from 6535e14 to e190435 Compare October 14, 2017 16:08

Merge branch 'master' into ndarray_tolerance

700b20a

Add back stuff accidentally removed on rebase

2b549b1

jreback reviewed Oct 14, 2017

View reviewed changes

Split timedelta condition for array scalar into separate if statement

7e7051a

jreback merged commit 5c0b20a into pandas-dev:master Oct 14, 2017

buntwo deleted the ndarray_tolerance branch October 14, 2017 21:59

ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 16, 2017

ENH: tolerance now takes list-like argument for reindex and get_index…

fcc91de

…er. (pandas-dev#17367)

alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017

ENH: tolerance now takes list-like argument for reindex and get_index…

b78898a

…er. (pandas-dev#17367)

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

ENH: tolerance now takes list-like argument for reindex and get_index…

f69c2ea

…er. (pandas-dev#17367)

ENH: tolerance now takes list-like argument for reindex and get_indexer. #17367

ENH: tolerance now takes list-like argument for reindex and get_indexer. #17367

Conversation

buntwo commented Aug 29, 2017

pep8speaks commented Aug 29, 2017 • edited Loading

Comment last updated on October 14, 2017 at 19:10 Hours UTC

jreback commented Aug 29, 2017

buntwo commented Aug 29, 2017 • edited Loading

gfyoung commented Aug 29, 2017 • edited Loading

codecov bot commented Sep 1, 2017

Codecov Report

codecov bot commented Sep 1, 2017 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

buntwo Sep 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

buntwo commented Oct 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

buntwo Oct 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Oct 14, 2017

jreback commented Oct 14, 2017

jreback commented Oct 14, 2017

jreback commented Oct 14, 2017

buntwo commented Oct 14, 2017 • edited Loading

Choose a reason for hiding this comment

buntwo commented Oct 14, 2017

jreback commented Oct 14, 2017

buntwo commented Oct 14, 2017

pep8speaks commented Aug 29, 2017 •

edited

Loading

buntwo commented Aug 29, 2017 •

edited

Loading

gfyoung commented Aug 29, 2017 •

edited

Loading

codecov bot commented Sep 1, 2017 •

edited

Loading

buntwo Sep 5, 2017 •

edited

Loading

buntwo Oct 14, 2017 •

edited

Loading

buntwo commented Oct 14, 2017 •

edited

Loading