Skip to content

Commit 67730dd

Browse files
tdhockjreback
authored andcommitted
ENH: str.extractall for several matches
Author: Toby Dylan Hocking <tdhock5@gmail.com> Closes #11386 from tdhock/extractall and squashes the following commits: 0c1c3d1 [Toby Dylan Hocking] ENH: extract(expand), extractall
1 parent 517c559 commit 67730dd

File tree

6 files changed

+913
-99
lines changed

6 files changed

+913
-99
lines changed

doc/source/api.rst

+1
Original file line numberDiff line numberDiff line change
@@ -526,6 +526,7 @@ strings and apply several methods to it. These can be accessed like
526526
Series.str.encode
527527
Series.str.endswith
528528
Series.str.extract
529+
Series.str.extractall
529530
Series.str.find
530531
Series.str.findall
531532
Series.str.get

doc/source/text.rst

+127-17
Original file line numberDiff line numberDiff line change
@@ -168,28 +168,37 @@ Extracting Substrings
168168

169169
.. _text.extract:
170170

171-
The method ``extract`` (introduced in version 0.13) accepts `regular expressions
172-
<https://docs.python.org/2/library/re.html>`__ with match groups. Extracting a
173-
regular expression with one group returns a Series of strings.
171+
Extract first match in each subject (extract)
172+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
174173

175-
.. ipython:: python
174+
.. versionadded:: 0.13.0
175+
176+
.. warning::
177+
178+
In version 0.18.0, ``extract`` gained the ``expand`` argument. When
179+
``expand=False`` it returns a ``Series``, ``Index``, or
180+
``DataFrame``, depending on the subject and regular expression
181+
pattern (same behavior as pre-0.18.0). When ``expand=True`` it
182+
always returns a ``DataFrame``, which is more consistent and less
183+
confusing from the perspective of a user.
176184

177-
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)')
185+
The ``extract`` method accepts a `regular expression
186+
<https://docs.python.org/2/library/re.html>`__ with at least one
187+
capture group.
178188

179-
Elements that do not match return ``NaN``. Extracting a regular expression
180-
with more than one group returns a DataFrame with one column per group.
189+
Extracting a regular expression with more than one group returns a
190+
DataFrame with one column per group.
181191

182192
.. ipython:: python
183193
184194
pd.Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)')
185195
186-
Elements that do not match return a row filled with ``NaN``.
187-
Thus, a Series of messy strings can be "converted" into a
188-
like-indexed Series or DataFrame of cleaned-up or more useful strings,
189-
without necessitating ``get()`` to access tuples or ``re.match`` objects.
190-
191-
The results dtype always is object, even if no match is found and the result
192-
only contains ``NaN``.
196+
Elements that do not match return a row filled with ``NaN``. Thus, a
197+
Series of messy strings can be "converted" into a like-indexed Series
198+
or DataFrame of cleaned-up or more useful strings, without
199+
necessitating ``get()`` to access tuples or ``re.match`` objects. The
200+
results dtype always is object, even if no match is found and the
201+
result only contains ``NaN``.
193202

194203
Named groups like
195204

@@ -201,9 +210,109 @@ and optional groups like
201210

202211
.. ipython:: python
203212
204-
pd.Series(['a1', 'b2', '3']).str.extract('(?P<letter>[ab])?(?P<digit>\d)')
213+
pd.Series(['a1', 'b2', '3']).str.extract('([ab])?(\d)')
214+
215+
can also be used. Note that any capture group names in the regular
216+
expression will be used for column names; otherwise capture group
217+
numbers will be used.
218+
219+
Extracting a regular expression with one group returns a ``DataFrame``
220+
with one column if ``expand=True``.
221+
222+
.. ipython:: python
223+
224+
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True)
225+
226+
It returns a Series if ``expand=False``.
227+
228+
.. ipython:: python
229+
230+
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False)
231+
232+
Calling on an ``Index`` with a regex with exactly one capture group
233+
returns a ``DataFrame`` with one column if ``expand=True``,
234+
235+
.. ipython:: python
236+
237+
s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"])
238+
s
239+
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)
240+
241+
It returns an ``Index`` if ``expand=False``.
242+
243+
.. ipython:: python
244+
245+
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)
246+
247+
Calling on an ``Index`` with a regex with more than one capture group
248+
returns a ``DataFrame`` if ``expand=True``.
249+
250+
.. ipython:: python
251+
252+
s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)
253+
254+
It raises ``ValueError`` if ``expand=False``.
255+
256+
.. code-block:: python
257+
258+
>>> s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
259+
ValueError: This pattern contains no groups to capture.
260+
261+
The table below summarizes the behavior of ``extract(expand=False)``
262+
(input subject in first column, number of groups in regex in
263+
first row)
264+
265+
+--------+---------+------------+
266+
| | 1 group | >1 group |
267+
+--------+---------+------------+
268+
| Index | Index | ValueError |
269+
+--------+---------+------------+
270+
| Series | Series | DataFrame |
271+
+--------+---------+------------+
272+
273+
Extract all matches in each subject (extractall)
274+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
275+
276+
.. _text.extractall:
277+
278+
Unlike ``extract`` (which returns only the first match),
279+
280+
.. ipython:: python
281+
282+
s = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
283+
s
284+
s.str.extract("[ab](?P<digit>\d)")
285+
286+
.. versionadded:: 0.18.0
287+
288+
the ``extractall`` method returns every match. The result of
289+
``extractall`` is always a ``DataFrame`` with a ``MultiIndex`` on its
290+
rows. The last level of the ``MultiIndex`` is named ``match`` and
291+
indicates the order in the subject.
292+
293+
.. ipython:: python
294+
295+
s.str.extractall("[ab](?P<digit>\d)")
296+
297+
When each subject string in the Series has exactly one match,
298+
299+
.. ipython:: python
300+
301+
s = pd.Series(['a3', 'b3', 'c2'])
302+
s
303+
two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
304+
305+
then ``extractall(pat).xs(0, level='match')`` gives the same result as
306+
``extract(pat)``.
307+
308+
.. ipython:: python
309+
310+
extract_result = s.str.extract(two_groups)
311+
extract_result
312+
extractall_result = s.str.extractall(two_groups)
313+
extractall_result
314+
extractall_result.xs(0, level="match")
205315
206-
can also be used.
207316
208317
Testing for Strings that Match or Contain a Pattern
209318
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -288,7 +397,8 @@ Method Summary
288397
:meth:`~Series.str.endswith`,Equivalent to ``str.endswith(pat)`` for each element
289398
:meth:`~Series.str.findall`,Compute list of all occurrences of pattern/regex for each string
290399
:meth:`~Series.str.match`,"Call ``re.match`` on each element, returning matched groups as list"
291-
:meth:`~Series.str.extract`,"Call ``re.match`` on each element, as ``match`` does, but return matched groups as strings for convenience."
400+
:meth:`~Series.str.extract`,"Call ``re.search`` on each element, returning DataFrame with one row for each element and one column for each regex capture group"
401+
:meth:`~Series.str.extractall`,"Call ``re.findall`` on each element, returning DataFrame with one row for each match and one column for each regex capture group"
292402
:meth:`~Series.str.len`,Compute string lengths
293403
:meth:`~Series.str.strip`,Equivalent to ``str.strip``
294404
:meth:`~Series.str.rstrip`,Equivalent to ``str.rstrip``

doc/source/whatsnew/v0.18.0.txt

+86
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,92 @@ New Behavior:
137137
s.index
138138
s.index.nbytes
139139

140+
.. _whatsnew_0180.enhancements.extract:
141+
142+
Changes to str.extract
143+
^^^^^^^^^^^^^^^^^^^^^^
144+
145+
The :ref:`.str.extract <text.extract>` method takes a regular
146+
expression with capture groups, finds the first match in each subject
147+
string, and returns the contents of the capture groups
148+
(:issue:`11386`). In v0.18.0, the ``expand`` argument was added to
149+
``extract``. When ``expand=False`` it returns a ``Series``, ``Index``,
150+
or ``DataFrame``, depending on the subject and regular expression
151+
pattern (same behavior as pre-0.18.0). When ``expand=True`` it always
152+
returns a ``DataFrame``, which is more consistent and less confusing
153+
from the perspective of a user. Currently the default is
154+
``expand=None`` which gives a ``FutureWarning`` and uses
155+
``expand=False``. To avoid this warning, please explicitly specify
156+
``expand``.
157+
158+
.. ipython:: python
159+
160+
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)')
161+
162+
Extracting a regular expression with one group returns a ``DataFrame``
163+
with one column if ``expand=True``.
164+
165+
.. ipython:: python
166+
167+
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True)
168+
169+
It returns a Series if ``expand=False``.
170+
171+
.. ipython:: python
172+
173+
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False)
174+
175+
Calling on an ``Index`` with a regex with exactly one capture group
176+
returns a ``DataFrame`` with one column if ``expand=True``,
177+
178+
.. ipython:: python
179+
180+
s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"])
181+
s
182+
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)
183+
184+
It returns an ``Index`` if ``expand=False``.
185+
186+
.. ipython:: python
187+
188+
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)
189+
190+
Calling on an ``Index`` with a regex with more than one capture group
191+
returns a ``DataFrame`` if ``expand=True``.
192+
193+
.. ipython:: python
194+
195+
s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)
196+
197+
It raises ``ValueError`` if ``expand=False``.
198+
199+
.. code-block:: python
200+
201+
>>> s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
202+
ValueError: only one regex group is supported with Index
203+
204+
In summary, ``extract(expand=True)`` always returns a ``DataFrame``
205+
with a row for every subject string, and a column for every capture
206+
group.
207+
208+
.. _whatsnew_0180.enhancements.extractall:
209+
210+
The :ref:`.str.extractall <text.extractall>` method was added
211+
(:issue:`11386`). Unlike ``extract`` (which returns only the first
212+
match),
213+
214+
.. ipython:: python
215+
216+
s = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
217+
s
218+
s.str.extract("(?P<letter>[ab])(?P<digit>\d)")
219+
220+
the ``extractall`` method returns all matches.
221+
222+
.. ipython:: python
223+
224+
s.str.extractall("(?P<letter>[ab])(?P<digit>\d)")
225+
140226
.. _whatsnew_0180.enhancements.rounding:
141227

142228
Datetimelike rounding

0 commit comments

Comments
 (0)