-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redefine match #5224
Redefine match #5224
Conversation
Today I realized there's a |
Ha! Well I'm glad this didn't take too long. Hmm. |
I just found it today - I had no idea. I guess we should just have match be a synonym with warning for extract and then deprecate? Or just let match keep doing its thing and emit warnings til we remove it? |
You might see if your implementation is better than That said, if you want to fix things like this in the future, easiest thing to do is diff your final commit against master, save the diff just on the files you care about, checkout master, then use git apply with the diff and then commit it again. That or copy/paste :) |
(I mean fix git issues.) |
@danielballan this needs to be rebased to eliminate the first commit |
@danielballan yes...I thought that was fine for 0.13....I haven't looked at this close...is that the intention here? |
No, this made match return a boolean indexer, which we now realize is already covered by str.contains. Cannot fix today, but probably tomorrow. |
I'm still mixed on the naming here. What if we make match use re.match whereas contains can use re.search? I'm still not sold on adding extract as a name and deprecating match. Though it might be surprising to have match return different ndim objects depending on number of groups, there's something nice about having one function work intelligently based on number of match groups. |
how's this coming along? |
We need a final decision on the naming. I have come around to preferring @jtratner's suggestion, where str.match and str.contains return boolean indexers based on re.match and re.search respectively. That leaves str.extract intact, effectively doing what str.match used to do, but in a more useful structure and (arguably) with more descriptive name. Examples from current PR branch "redefine-match":
If all this looks good, all we need are docs. |
Your suggested api looks good, but I'd prefer extract use re.search (if I'd prefer the match warning to say 'In future versions of pandas, match Similarly, if as_indexer is True and the regex has match groups (or using |
Changes made. If we all like them, I'll write docs tomorrow. Let me know. |
fyi....seems you have an odd commit in there....can you rebase off of current master? |
can you rebase against master? |
Sorry, nonstop work at my day job, pushing a paper out. Will do today when I get to my work machine. |
Fixed. Thanks for the git guidance (awhile ago) @jtratner. Ready for docs? Are we happy with this? |
I like this. So to summarize we are effectively deprecating |
0.12: 0.13: 0.14: |
I would say go ahead with the docs (put that in a separate commit).... for v0.13.0 and in the string section |
@danielballan I would possibly make some sub-sections under the string methods, e.g. maybe for the methods that you are actually explaining (e.g. match/extract/split.....). |
I'm very happy with this (and I appreciate your flexibility!)- can you confirm one thing for me:
So this means when match gets something with zero match groups, it returns a boolean indexer? That way for the majority of use cases match will just work. (i.e., no need to even put examples in the docs with match groups) and then we'd warn when there were match groups but preserve existing. Thanks for putting this together! |
I hadn't thought of that. Currently, match only returns a boolean indexer is you set Sorry for the delay on my end. Not losing patience, just short on free time this week. |
@danielballan how's this coming? |
Starting the docs, I came across this: Methods like The new version of match should do that as well. The only different between contains and the new match should be re.search vs. re.match. |
that's fine, fine with me if you incorporate that. |
Just ping me when you think this is all ready to go. Also, if you want to |
I'd really like this to be in 0.13 rc, because it will wrap up str methods |
Ready. Now the APIs and underlying codes for |
As long as it passes the tests, I'm fine with this. I probably would want to edit the docs somewhat to just show what you can do in 0.13 and not bother mentioning all the weird deprecated things you can do (esp since I think you can always go back and look at docs for older versions), but I'm fine with doing it later and getting this functionality merged for now :) |
I waffled on how much to explain, and probably erred saying too much. I'll heading back over to see if I can fix filter in time for the RC. We can come back to this. |
Yeah, exactly, docs are easy to pare down later - just commenting that we should do it at some point. Frankly, the entire docs section on string matching could be tightened up a bit. |
@@ -411,10 +419,52 @@ def test_match(self): | |||
# unicode | |||
values = Series([u('fooBAD__barBAD'), NA, u('foo')]) | |||
|
|||
result = values.str.match('.*(BAD[_]+).*(BAD)') | |||
with warnings.catch_warnings(record=True) as w: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
future reference - you can use assert_produces_warning()
here as well, so you could get down to:
with tm.assert_produces_warning():
result = values.str.match('.*(BAD[_]+).*(BAD)')
Thanks! |
One workaround/trick is to do (where regex is
|
… match (GH5224) This PR changes the default behaviour of `str.match` from extracting groups to just a match (True/False). The previous default behaviour was deprecated since 0.13.0 (#5224) Author: Joris Van den Bossche <jorisvandenbossche@gmail.com> Closes #15257 from jorisvandenbossche/str-match and squashes the following commits: 0ab36b6 [Joris Van den Bossche] Raise FutureWarning instead of UserWarning for as_indexer a2bae51 [Joris Van den Bossche] raise error in case of regex with groups and as_indexer=False 87446c3 [Joris Van den Bossche] fix test 0788de2 [Joris Van den Bossche] API: change default behaviour of str.match from deprecated extract to match (GH5224)
… match (GH5224) This PR changes the default behaviour of `str.match` from extracting groups to just a match (True/False). The previous default behaviour was deprecated since 0.13.0 (pandas-dev#5224) Author: Joris Van den Bossche <jorisvandenbossche@gmail.com> Closes pandas-dev#15257 from jorisvandenbossche/str-match and squashes the following commits: 0ab36b6 [Joris Van den Bossche] Raise FutureWarning instead of UserWarning for as_indexer a2bae51 [Joris Van den Bossche] raise error in case of regex with groups and as_indexer=False 87446c3 [Joris Van den Bossche] fix test 0788de2 [Joris Van den Bossche] API: change default behaviour of str.match from deprecated extract to match (GH5224)
closes #5075
How does this look? If the docstring and the tests reflect our consensus, I'll take a stab at the docs. This is the gist of it:
Default behavior is unchanged, but issues a warning.
New, more useful behavior is available through
as_indexer
.P.S. There's a stray commit in here. Not sure why 88716e4 got lumped in....