@@ -168,28 +168,37 @@ Extracting Substrings
168
168
169
169
.. _text.extract :
170
170
171
- The method ``extract `` (introduced in version 0.13) accepts `regular expressions
172
- <https://docs.python.org/2/library/re.html> `__ with match groups. Extracting a
173
- regular expression with one group returns a Series of strings.
171
+ Extract first match in each subject (extract)
172
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
174
173
175
- .. ipython :: python
174
+ .. versionadded :: 0.13.0
175
+
176
+ .. warning ::
177
+
178
+ In version 0.18.0, ``extract `` gained the ``expand `` argument. When
179
+ ``expand=False `` it returns a ``Series ``, ``Index ``, or
180
+ ``DataFrame ``, depending on the subject and regular expression
181
+ pattern (same behavior as pre-0.18.0). When ``expand=True `` it
182
+ always returns a ``DataFrame ``, which is more consistent and less
183
+ confusing from the perspective of a user.
176
184
177
- pd.Series([' a1' , ' b2' , ' c3' ]).str.extract(' [ab](\d)' )
185
+ The ``extract `` method accepts a `regular expression
186
+ <https://docs.python.org/2/library/re.html> `__ with at least one
187
+ capture group.
178
188
179
- Elements that do not match return `` NaN ``. Extracting a regular expression
180
- with more than one group returns a DataFrame with one column per group.
189
+ Extracting a regular expression with more than one group returns a
190
+ DataFrame with one column per group.
181
191
182
192
.. ipython :: python
183
193
184
194
pd.Series([' a1' , ' b2' , ' c3' ]).str.extract(' ([ab])(\d)' )
185
195
186
- Elements that do not match return a row filled with ``NaN ``.
187
- Thus, a Series of messy strings can be "converted" into a
188
- like-indexed Series or DataFrame of cleaned-up or more useful strings,
189
- without necessitating ``get() `` to access tuples or ``re.match `` objects.
190
-
191
- The results dtype always is object, even if no match is found and the result
192
- only contains ``NaN ``.
196
+ Elements that do not match return a row filled with ``NaN ``. Thus, a
197
+ Series of messy strings can be "converted" into a like-indexed Series
198
+ or DataFrame of cleaned-up or more useful strings, without
199
+ necessitating ``get() `` to access tuples or ``re.match `` objects. The
200
+ results dtype always is object, even if no match is found and the
201
+ result only contains ``NaN ``.
193
202
194
203
Named groups like
195
204
@@ -201,9 +210,109 @@ and optional groups like
201
210
202
211
.. ipython :: python
203
212
204
- pd.Series([' a1' , ' b2' , ' 3' ]).str.extract(' (?P<letter>[ab])?(?P<digit>\d)' )
213
+ pd.Series([' a1' , ' b2' , ' 3' ]).str.extract(' ([ab])?(\d)' )
214
+
215
+ can also be used. Note that any capture group names in the regular
216
+ expression will be used for column names; otherwise capture group
217
+ numbers will be used.
218
+
219
+ Extracting a regular expression with one group returns a ``DataFrame ``
220
+ with one column if ``expand=True ``.
221
+
222
+ .. ipython :: python
223
+
224
+ pd.Series([' a1' , ' b2' , ' c3' ]).str.extract(' [ab](\d)' , expand = True )
225
+
226
+ It returns a Series if ``expand=False ``.
227
+
228
+ .. ipython :: python
229
+
230
+ pd.Series([' a1' , ' b2' , ' c3' ]).str.extract(' [ab](\d)' , expand = False )
231
+
232
+ Calling on an ``Index `` with a regex with exactly one capture group
233
+ returns a ``DataFrame `` with one column if ``expand=True ``,
234
+
235
+ .. ipython :: python
236
+
237
+ s = pd.Series([" a1" , " b2" , " c3" ], [" A11" , " B22" , " C33" ])
238
+ s
239
+ s.index.str.extract(" (?P<letter>[a-zA-Z])" , expand = True )
240
+
241
+ It returns an ``Index `` if ``expand=False ``.
242
+
243
+ .. ipython :: python
244
+
245
+ s.index.str.extract(" (?P<letter>[a-zA-Z])" , expand = False )
246
+
247
+ Calling on an ``Index `` with a regex with more than one capture group
248
+ returns a ``DataFrame `` if ``expand=True ``.
249
+
250
+ .. ipython :: python
251
+
252
+ s.index.str.extract(" (?P<letter>[a-zA-Z])([0-9]+)" , expand = True )
253
+
254
+ It raises ``ValueError `` if ``expand=False ``.
255
+
256
+ .. code-block :: python
257
+
258
+ >> > s.index.str.extract(" (?P<letter>[a-zA-Z])([0-9]+)" , expand = False )
259
+ ValueError : This pattern contains no groups to capture.
260
+
261
+ The table below summarizes the behavior of ``extract(expand=False) ``
262
+ (input subject in first column, number of groups in regex in
263
+ first row)
264
+
265
+ +--------+---------+------------+
266
+ | | 1 group | >1 group |
267
+ +--------+---------+------------+
268
+ | Index | Index | ValueError |
269
+ +--------+---------+------------+
270
+ | Series | Series | DataFrame |
271
+ +--------+---------+------------+
272
+
273
+ Extract all matches in each subject (extractall)
274
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
275
+
276
+ .. _text.extractall :
277
+
278
+ Unlike ``extract `` (which returns only the first match),
279
+
280
+ .. ipython :: python
281
+
282
+ s = pd.Series([" a1a2" , " b1" , " c1" ], [" A" , " B" , " C" ])
283
+ s
284
+ s.str.extract(" [ab](?P<digit>\d)" )
285
+
286
+ .. versionadded :: 0.18.0
287
+
288
+ the ``extractall `` method returns every match. The result of
289
+ ``extractall `` is always a ``DataFrame `` with a ``MultiIndex `` on its
290
+ rows. The last level of the ``MultiIndex `` is named ``match `` and
291
+ indicates the order in the subject.
292
+
293
+ .. ipython :: python
294
+
295
+ s.str.extractall(" [ab](?P<digit>\d)" )
296
+
297
+ When each subject string in the Series has exactly one match,
298
+
299
+ .. ipython :: python
300
+
301
+ s = pd.Series([' a3' , ' b3' , ' c2' ])
302
+ s
303
+ two_groups = ' (?P<letter>[a-z])(?P<digit>[0-9])'
304
+
305
+ then ``extractall(pat).xs(0, level='match') `` gives the same result as
306
+ ``extract(pat) ``.
307
+
308
+ .. ipython :: python
309
+
310
+ extract_result = s.str.extract(two_groups)
311
+ extract_result
312
+ extractall_result = s.str.extractall(two_groups)
313
+ extractall_result
314
+ extractall_result.xs(0 , level = " match" )
205
315
206
- can also be used.
207
316
208
317
Testing for Strings that Match or Contain a Pattern
209
318
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -288,7 +397,8 @@ Method Summary
288
397
:meth: `~Series.str.endswith `,Equivalent to ``str.endswith(pat) `` for each element
289
398
:meth: `~Series.str.findall `,Compute list of all occurrences of pattern/regex for each string
290
399
:meth: `~Series.str.match `,"Call ``re.match `` on each element, returning matched groups as list"
291
- :meth: `~Series.str.extract `,"Call ``re.match `` on each element, as ``match `` does, but return matched groups as strings for convenience."
400
+ :meth: `~Series.str.extract `,"Call ``re.search `` on each element, returning DataFrame with one row for each element and one column for each regex capture group"
401
+ :meth: `~Series.str.extractall `,"Call ``re.findall `` on each element, returning DataFrame with one row for each match and one column for each regex capture group"
292
402
:meth: `~Series.str.len `,Compute string lengths
293
403
:meth: `~Series.str.strip `,Equivalent to ``str.strip ``
294
404
:meth: `~Series.str.rstrip `,Equivalent to ``str.rstrip ``
0 commit comments