Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: tolerance now takes list-like argument for reindex and get_indexer. #17367

Merged
merged 6 commits into from
Oct 14, 2017

Conversation

buntwo
Copy link
Contributor

@buntwo buntwo commented Aug 29, 2017

Enable use of list-like values for tolerance argument in DataFrame.reindex(), Series.reindex(), Index.get_indexer().

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@pep8speaks
Copy link

pep8speaks commented Aug 29, 2017

Hello @buntwo! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on October 14, 2017 at 19:10 Hours UTC

@jreback
Copy link
Contributor

jreback commented Aug 29, 2017

does this have an associated issue?

can u show an example use case

@buntwo
Copy link
Contributor Author

buntwo commented Aug 29, 2017

Use case: suppose you have a timeseries of values for a stock, with some days with missing data in between, and you want to select some date range that includes the missing data. You want to have some fill backward (forward, nearest) N-many-days logic when doing so, but you want N to depend on some other property of the stock that varies as time. Currently pandas only supports N constant. This features enables this use case, by allowing the user to construct a list-like object of the same size that contains the tolerance per point and pass that in.

@gfyoung
Copy link
Member

gfyoung commented Aug 29, 2017

@buntwo : It seems like you want the tolerance time-varying. I suppose that could be useful, but I haven't seen a use-case like this really.

I would suggest you open an issue for this and point your PR to it so we can discuss.

In the future, for enhancements like this, it would be preferable to open an issue first so that we can discuss rather than in the PR.

@gfyoung gfyoung added 2/3 Compat Indexing Related to indexing on series/frames, not to indexes themselves and removed 2/3 Compat labels Aug 29, 2017
@codecov
Copy link

codecov bot commented Sep 1, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@36dadd7). Click here to learn what that means.
The diff coverage is 95.91%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #17367   +/-   ##
=========================================
  Coverage          ?   91.02%           
=========================================
  Files             ?      162           
  Lines             ?    49603           
  Branches          ?        0           
=========================================
  Hits              ?    45150           
  Misses            ?     4453           
  Partials          ?        0
Flag Coverage Δ
#multiple 88.8% <95.91%> (?)
#single 40.22% <10.2%> (?)
Impacted Files Coverage Δ
pandas/core/generic.py 91.99% <ø> (ø)
pandas/core/indexes/datetimelike.py 96.75% <100%> (ø)
pandas/core/indexes/numeric.py 97.24% <100%> (ø)
pandas/core/indexes/datetimes.py 95.35% <100%> (ø)
pandas/core/indexes/base.py 95.89% <100%> (ø)
pandas/core/indexes/period.py 92.5% <77.77%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36dadd7...d6ec1d6. Read the comment docs.

@codecov
Copy link

codecov bot commented Sep 1, 2017

Codecov Report

Merging #17367 into master will decrease coverage by 0.02%.
The diff coverage is 92.68%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17367      +/-   ##
==========================================
- Coverage   91.24%   91.21%   -0.03%     
==========================================
  Files         163      163              
  Lines       50080    50099      +19     
==========================================
+ Hits        45693    45700       +7     
- Misses       4387     4399      +12
Flag Coverage Δ
#multiple 89.02% <92.68%> (-0.01%) ⬇️
#single 40.31% <21.95%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 92.2% <ø> (ø) ⬆️
pandas/core/indexes/timedeltas.py 91.19% <100%> (ø) ⬆️
pandas/core/tools/timedeltas.py 98.38% <100%> (+0.05%) ⬆️
pandas/core/indexes/datetimelike.py 97.1% <100%> (ø) ⬆️
pandas/core/indexes/numeric.py 97.23% <100%> (+0.05%) ⬆️
pandas/core/indexes/base.py 96.42% <80%> (-0.06%) ⬇️
pandas/core/indexes/datetimes.py 95.5% <83.33%> (-0.08%) ⬇️
pandas/core/indexes/period.py 92.68% <91.66%> (-0.11%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e001500...7e7051a. Read the comment docs.

@buntwo buntwo closed this Sep 1, 2017
@buntwo buntwo reopened this Sep 1, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments in-line. you are doing quite a bit of extra checking, I think you can pare this down substantially. you may need to adjust your tests slightly as well.

@@ -130,8 +130,7 @@ Other Enhancements
- `read_*` methods can now infer compression from non-string paths, such as ``pathlib.Path`` objects (:issue:`17206`).
- :func:`pd.read_sas()` now recognizes much more of the most frequently used date (datetime) formats in SAS7BDAT files (:issue:`15871`).
- :func:`DataFrame.items` and :func:`Series.items` is now present in both Python 2 and 3 and is lazy in all cases (:issue:`13918`, :issue:`17213`)


- :func:`Series.reindex`, :func:`DataFrame.reindex`, :func:`Index.get_indexer` now support list-like argument for `tolerance`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double back-ticks on tolerance

@@ -2658,8 +2659,14 @@ def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations most
satisfy the equation ``abs(index[indexer] - target) <= tolerance``.
Tolerance may be a scalar value, which applies the same tolerance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a blank line where you have new text (I think it will render ok)

@@ -2881,8 +2888,14 @@ def _reindex_multi(self, axes, copy, fill_value):
Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations most
satisfy the equation ``abs(index[indexer] - target) <= tolerance``.
Tolerance may be a scalar value, which applies the same tolerance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same (bonus points if we can use a _shared_docs here to avoid code duplication)

@@ -2444,9 +2443,14 @@ def _get_unique_index(self, dropna=False):
tolerance : optional
Maximum distance from index value for inexact matches. The value of
the index at the matching location most satisfy the equation
``abs(index[loc] - key) <= tolerance``.
``abs(index[loc] - key) <= tolerance``. Tolerance may be a scalar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -2566,8 +2570,14 @@ def _get_level_values(self, level):
Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations most
satisfy the equation ``abs(index[indexer] - target) <= tolerance``.
Tolerance may be a scalar value, which applies the same tolerance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same. actually let's fix the shared_docs things first in another PR (or here is ok too). too much doc-string duplication.

Copy link
Contributor Author

@buntwo buntwo Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. How would I do that though? This (and the 3 others above) is already in a _shared_docs, so this would require another layer of abstraction above _shared_docs. Quite a bit of duplication with the existing tolerance docstrings as well.

except KeyError:
raise KeyError(key)
except ValueError as e:
# ndarray tolerance size must match target index size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list like tolerance size mismatch raises a ValueError, but so does something else, so I differentiate here...definitely a better way to do this.

except ValueError:
raise ValueError('tolerance argument for %s must be numeric: %r' %
(type(self).__name__, tolerance))
tolerance = _list_to_ndarray(tolerance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, just use np.asarray(...), no need for the rest of this

offset = frequencies.to_offset(self.freq.rule_code)
if isinstance(offset, offsets.Tick):
nanos = tslib._delta_to_nanoseconds(other)
offset_nanos = tslib._delta_to_nanoseconds(offset)
if nanos % offset_nanos == 0:
if isinstance(other, np.ndarray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just d

check = np.all(nanos % offset_nanos == 0)
which will work for scalars as well

@@ -775,6 +780,10 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None):

if tolerance is not None:
tolerance = self._convert_tolerance(tolerance)
if isinstance(tolerance, np.ndarray) and \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check should already be in _convert_tolerance (I am not sure I mentioned above)

@@ -902,6 +911,10 @@ def _get_string_slice(self, key):

def _convert_tolerance(self, tolerance):
tolerance = DatetimeIndexOpsMixin._convert_tolerance(self, tolerance)
if isinstance(tolerance, np.ndarray) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@buntwo
Copy link
Contributor Author

buntwo commented Oct 14, 2017

The failed test in appveyor I think has nothing to do with my change, didn't fail on the last commit.

@@ -234,6 +234,7 @@ Other Enhancements
- :meth:`DataFrame.assign` will preserve the original order of ``**kwargs`` for Python 3.6+ users instead of sorting the column names. (:issue:`14207`)
- Improved the import time of pandas by about 2.25x. (:issue:`16764`)
- :func:`read_json` and :func:`to_json` now accept a ``compression`` argument which allows them to transparently handle compressed files. (:issue:`17798`)
- :func:`Series.reindex`, :func:`DataFrame.reindex`, :func:`Index.get_indexer` now support list-like argument for ``tolerance``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number (or PR number if no issue)

elif is_list_like(arg) and getattr(arg, 'ndim', 1) <= 1:
if getattr(arg, 'ndim', 1) == 0:
# extract array scalar and process below
arg = arg[()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use .item()

actual = idx.get_indexer([0.2, 1.8, 8.5], method=method,
tolerance=[0.5, 0.2, 0.2])
tm.assert_numpy_array_equal(actual, np.array(expected,
dtype=np.intp))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u use parametrize here to make the tests more readable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be a separate test

Copy link
Contributor Author

@buntwo buntwo Oct 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little of unsure of what to parameterize here: in each for loop, I actually just copy the last test (actual = idx.(..., tm.assert_numpy_array_equal(...) using scalar tolerance but change the tolerance to an array. I could add some comments to make it clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parametrize on method,expected

@@ -783,6 +788,9 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None):

if tolerance is not None:
tolerance = self._convert_tolerance(tolerance)
if target.size != tolerance.size and tolerance.size > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think would also pass target to _convert_tolerance so
can move error checks there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would change all _convert_tolerance to pass target as well
to

offset = frequencies.to_offset(self.freq.rule_code)
if isinstance(offset, offsets.Tick):
nanos = tslib._delta_to_nanoseconds(other)
if isinstance(other, np.ndarray):
nanos = np.vectorize(tslib._delta_to_nanoseconds)(other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the array version of this function is almost trivial
if u can add it alongside the other and call here
(u just need to type the input as ndarray i think)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually ignore the above this is ok here

[[0, 2, -1], [0, -1, -1],
[-1, 2, 9]]):
# list
actual = idx.get_indexer([0.2, 1.8, 8.5], method='nearest',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also parametrize on the type of list-like

@jreback jreback added this to the 0.21.0 milestone Oct 14, 2017
@jreback
Copy link
Contributor

jreback commented Oct 14, 2017

@buntwo ok parametrized the tests, ping on green.

@jreback
Copy link
Contributor

jreback commented Oct 14, 2017

going to fail from some flake errors

git diff master -u "*.py" | flake8 --diff
will show you them

@jreback
Copy link
Contributor

jreback commented Oct 14, 2017

also rebase on master (you don't need to squash, though its easier to do so)

@buntwo buntwo force-pushed the ndarray_tolerance branch from 6535e14 to e190435 Compare October 14, 2017 16:08
@jreback
Copy link
Contributor

jreback commented Oct 14, 2017

your rebasing is removing things. just leave it and restore code (in a new commit) that was removed. making sure all tests pass.

@buntwo
Copy link
Contributor Author

buntwo commented Oct 14, 2017

Apologies for the mistake in rebasing. I used git diff to cherry pick the unintended changes and git apply -R to revert.

@@ -83,8 +83,12 @@ def to_timedelta(arg, unit='ns', box=True, errors='raise'):
elif isinstance(arg, ABCIndexClass):
return _convert_listlike(arg, unit=unit, box=box,
errors=errors, name=arg.name)
elif is_list_like(arg) and getattr(arg, 'ndim', 1) == 1:
return _convert_listlike(arg, unit=unit, box=box, errors=errors)
elif is_list_like(arg) and getattr(arg, 'ndim', 1) <= 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you separate these into 2 conditions on the main if/else, IOW

elif is_list_like(arg) and getattr(arg, 'ndim', 1) == 0:
     arg = arg.item()

# this would then be unchanged
elif is_list_like(arg) and getattr(arg, 'ndim', 1) == 1:
     return _convert.....

@buntwo
Copy link
Contributor Author

buntwo commented Oct 14, 2017

Ping

@jreback jreback merged commit 5c0b20a into pandas-dev:master Oct 14, 2017
@jreback
Copy link
Contributor

jreback commented Oct 14, 2017

thanks @buntwo

@buntwo
Copy link
Contributor Author

buntwo commented Oct 14, 2017

thank you @jreback !

@buntwo buntwo deleted the ndarray_tolerance branch October 14, 2017 21:59
ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 16, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 16, 2017
* upstream/master: (76 commits)
  CategoricalDtype construction: actually use fastpath (pandas-dev#17891)
  DEPR: Deprecate tupleize_cols in to_csv (pandas-dev#17877)
  BUG: Fix wrong column selection in drop_duplicates when duplicate column names (pandas-dev#17879)
  DOC: Adding examples to update docstring (pandas-dev#16812) (pandas-dev#17859)
  TST: Skip if no openpyxl in test_excel (pandas-dev#17883)
  TST: Catch read_html slow test warning (pandas-dev#17874)
  flake8 cleanup (pandas-dev#17873)
  TST: remove moar warnings (pandas-dev#17872)
  ENH: tolerance now takes list-like argument for reindex and get_indexer. (pandas-dev#17367)
  ERR: Raise ValueError when week is passed in to_datetime format witho… (pandas-dev#17819)
  TST: remove some deprecation warnings (pandas-dev#17870)
  Refactor index-as-string groupby tests and fix spurious warning (Bug 17383) (pandas-dev#17843)
  BUG: merging with a boolean/int categorical column (pandas-dev#17841)
  DEPR: Deprecate read_csv arguments fully (pandas-dev#17865)
  BUG: to_json - prevent various segfault conditions (GH14256) (pandas-dev#17857)
  CLN: Use pandas.core.common for None checks (pandas-dev#17816)
  BUG: set tz on DTI from fixed format HDFStore (pandas-dev#17844)
  RLS: v0.21.0rc1
  Whatsnew cleanup (pandas-dev#17858)
  DEPR: Deprecate the convert parameter completely (pandas-dev#17831)
  ...
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants