You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/StronglyTypedCaptures.md
+49-33
Original file line number
Diff line number
Diff line change
@@ -37,10 +37,10 @@ if let match = "abcddddefgh".firstMatch(of: regex) {
37
37
38
38
>_**Note:** The `Regex` type includes, and `firstMatch(of:)` returns, the entire match as the "0th element"._
39
39
40
-
We introduce a generic type `Regex<Match>`, which treats the type of captures as part of a regular expression's type information for clarity, type safety, and convenience. As we explore a fundamental design aspect of the regular expression feature, this pitch discusses the following topics:
40
+
We introduce a generic type `Regex<Match>`, which treats the capture types as part of a regular expression's type information for clarity, type safety, and convenience. As we explore a fundamental design aspect of the regular expression feature, this pitch discusses the following topics:
41
41
42
42
- A type definition of the generic type `Regex<Match>` and `firstMatch(of:)` method.
43
-
-Capture type inference and composition in regular expression literals and the forthcoming result builder syntax.
43
+
-Inference and composition of capture types in regular expression literals and the forthcoming result builder syntax.
44
44
- New language features which this design may require.
45
45
46
46
The focus of this pitch is the structural properties of capture types and how regular expression patterns compose to form new capture types. The semantics of string matching, its effect on the capture types (i.e. `UnicodeScalarView.SubSequence` or `Substring`), and the result builder syntax will be discussed in future pitches.
@@ -52,7 +52,7 @@ For background on Declarative String Processing, see related topics:
52
52
53
53
## Motivation
54
54
55
-
Across a variety of programming languages, many established regular expression libraries present captures as a collection of captured content to the caller upon a successful match [[1](https://developer.apple.com/documentation/foundation/nsregularexpression)][[2](https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.capture)]. However, to know the structure of captured contents, programmers often need to carefully read the regular expression or run the regular expression on some input to find out. Because regular expressions are oftentimes statically available in the source code, there is a missed opportunity to use generics to present captures as part of type information to the programmer, and to leverage the compiler to infer the type of captures based on a regular expression literal. As we propose to introduce declarative string processing capabilities to the language and the Standard Library, we would like to explore a type-safe approach to regular expression captures.
55
+
Across a variety of programming languages, many established regular expression libraries present a collection of captured content to the caller upon a successful match [[1](https://developer.apple.com/documentation/foundation/nsregularexpression)][[2](https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.capture)]. However, to know the structure of captured contents, programmers often need to carefully read the regular expression or run the regular expression on some input to find out. Because regular expressions are oftentimes statically available in the source code, there is a missed opportunity to use generics to present captures as part of type information to the programmer, and to leverage the compiler to infer the type of captures based on a regular expression literal. As we propose to introduce declarative string processing capabilities to the language and the Standard Library, we would like to explore a type-safe approach to regular expression captures.
56
56
57
57
## Proposed solution
58
58
@@ -77,6 +77,15 @@ if let match = "abcddddefgh".firstMatch(of: regex) {
77
77
}
78
78
```
79
79
80
+
Quantifiers (`*`, `+`, and `?`) and alternations (`|`) wrap each capture inside them in `Array` or `Optional`. These structures can be nested, so a capture which is inside multiple levels of quantifiers or alternations will end up with a type like `[Substring?]?`. To ensure that backreference numbering and tuple element numbering match, each capture is separately wrapped in the structure implied by the quantifiers and alternations around it, rather than wrapping tuples of adjacent captures in the structure.
81
+
82
+
```swift
83
+
let regex =/ab(?:c(d)*(ef))?gh/
84
+
iflet match ="abcddddefgh".firstMatch(of: regex) {
@@ -114,7 +123,7 @@ if let match = line.firstMatch(of: scalarRangePattern) {
114
123
115
124
> ***Note**: Additional features like efficient access to the matched ranges are out-of-scope for this pitch, but will likely mean returning a nominal type from `firstMatch(of:)`. In this pitch, the result type of `firstMatch(of:)` is a tuple of `Substring`s for simplicity and brevity. Either way, the developer experience is meant to be light-weight and tuple-y. Any nominal type would likely come with dynamic member lookup for accessing captures by index (i.e. `.0`, `.1`, etc.) and name.*
116
125
117
-
### Capture type
126
+
### Capture types
118
127
119
128
In this section, we describe the inferred capture types for regular expression patterns and how they compose.
120
129
@@ -126,11 +135,7 @@ By default, a regular expression literal has type `Regex`. Its generic argument
126
135
Capture types
127
136
```
128
137
129
-
When there are no captures, `Match` is just the entire matched substring.
130
-
131
-
#### Basics
132
-
133
-
Regular expressions without any capturing groups have type `Regex<Substring>`, for example:
138
+
When there are no captures, `Match` is just the entire matched substring, for example:
134
139
135
140
```swift
136
141
let identifier =/[_a-zA-Z]+[_a-zA-Z0-9]*/// => `Regex<Substring>`
This falls out of Swift's normal type system rules, which treat a 1-tuple as synonymous with the element itself.
151
+
145
152
#### Capturing group: `(...)`
146
153
147
-
A capturing group saves the portion of the input matched by its contained pattern. Its capture type is `Substring`.
154
+
A capturing group saves the portion of the input matched by its contained pattern. The capture type of a leaf capturing group is `Substring`.
148
155
149
156
```swift
150
157
let graphemeBreakLowerBound =/([0-9a-fA-F]+)/
@@ -156,7 +163,7 @@ let graphemeBreakLowerBound = /([0-9a-fA-F]+)/
156
163
157
164
#### Concatenation: `abc`
158
165
159
-
A concatenation's `Match` is a tuple of `Substring`s followed by every pattern's capture type. When there are no capturing groups, the `Match` is just `Substring`.
166
+
A concatenation's capture types are a concatenation of the capture types of its underlying patterns, ignoring any underlying patterns with no captures.
160
167
161
168
```swift
162
169
let graphemeBreakLowerBound =/([0-9a-fA-F]+)\.\.[0-9a-fA-F]+/
@@ -182,7 +189,7 @@ let graphemeBreakRange = /([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)/
182
189
183
190
#### Named capturing group: `(?<name>...)`
184
191
185
-
A named capturing group's capture type is `Substring`. In its `Match` type, the capture type has a tuple element label specified by the capture name.
192
+
A named capturing group includes the capture's name as the label of the tuple element.
186
193
187
194
```swift
188
195
let graphemeBreakLowerBound =/(?<lower>[0-9a-fA-F]+)\.\.[0-9a-fA-F]+/
@@ -194,7 +201,7 @@ let graphemeBreakRange = /(?<lower>[0-9a-fA-F]+)\.\.(?<upper>[0-9a-fA-F]+)/
194
201
195
202
#### Non-capturing group: `(?:...)`
196
203
197
-
A non-capturing group's capture type is the same as its underlying pattern's. That is, it does not capture anything by itself, but transparently propagates its underlying pattern's captures.
204
+
A non-capturing group's capture types are the same as its underlying pattern's. That is, it does not capture anything by itself, but transparently propagates its underlying pattern's captures.
198
205
199
206
```swift
200
207
let graphemeBreakLowerBound =/([0-9a-fA-F]+)(?:\.\.([0-9a-fA-F]+))?/
@@ -212,7 +219,7 @@ let graphemeBreakLowerBound = /([0-9a-fA-F]+)(?:\.\.([0-9a-fA-F]+))?/
212
219
213
220
#### Nested capturing group: `(...(...))`
214
221
215
-
When capturing group is nested within another capturing group, they count as two distinct captures in the order their left parenthesis first appears in the regular expression literal. This is consistent with traditional regex backreference numbering.
222
+
When a capturing group is nested within another capturing group, they count as two distinct captures in the order their left parenthesis first appears in the regular expression literal. This is consistent with traditional regex backreference numbering.
216
223
217
224
```swift
218
225
let graphemeBreakPropertyData =/(([0-9a-fA-F]+)(\.\.([0-9a-fA-F]+)))\s*;\s(\w+).*/
@@ -246,16 +253,16 @@ let input = "007F..009F ; Control"
A quantifier wraps its underlying pattern's capture type in either an `Optional` or `Array`. Zero-or-one quantification (`?`) produces an `Optional` and all others produce an `Array`. The kind of quantification, i.e. greedy vs reluctant vs possessive, is irrelevant to determining the capture type.
256
+
A quantifier wraps its underlying pattern's capture types in either an `Optional`s or `Array`s. Zero-or-one quantification (`?`) produces an `Optional` and all others produce an `Array`. The kind of quantification, i.e. greedy vs reluctant vs possessive, is irrelevant to determining the capture type.
|`*`| 0 or more |`Array` of the sub-pattern capture type |
254
-
|`+`| 1 or more |`Array` of the sub-pattern capture type |
255
-
|`?`| 0 or 1 |`Optional` of the sub-pattern capture type |
256
-
|`{n}`| Exactly _n_|`Array` of the sub-pattern capture type |
257
-
|`{n,m}`| Between _n_ and _m_|`Array` of the sub-pattern capture type |
258
-
|`{n,}`|_n_ or more |`Array` of the sub-pattern capture type |
260
+
|`*`| 0 or more |`Array`s of the sub-pattern capture types|
261
+
|`+`| 1 or more |`Array`s of the sub-pattern capture types|
262
+
|`?`| 0 or 1 |`Optional`s of the sub-pattern capture types|
263
+
|`{n}`| Exactly _n_|`Array`s of the sub-pattern capture types|
264
+
|`{n,m}`| Between _n_ and _m_|`Array`s of the sub-pattern capture types|
265
+
|`{n,}`|_n_ or more |`Array`s of the sub-pattern capture types|
259
266
260
267
```swift
261
268
/([0-9a-fA-F]+)+/
@@ -360,12 +367,21 @@ if let match = "1234-5678-9abc-def0".firstMatch(of: pattern) {
360
367
// Prints ["1234", "5678", "9abc", "def0"]
361
368
```
362
369
363
-
We believe that the proposed capture behavior leads to better consistency with the meaning of these quantifiers. However, the alternative behavior does have the advantage of a smaller memory footprint because the matching algorithm would not need to allocate storage for capturing anything but the last match. As a future direction, we could introduce some way of opting into this behavior.
370
+
We believe that the proposed capture behavior is more intuitive. However, the alternative behavior has a smaller memory footprint and is more consistent with usage of backreferences, which only refer to the last match of the repeated capture group:
371
+
372
+
```swift
373
+
let pattern =/(?:([0-9a-fA-F]+)-?)+\1/
374
+
var match ="1234-5678-9abc-def0 def0".firstMatch(of: pattern)
375
+
print(match !=nil) // true
376
+
var match ="1234-5678-9abc-def0 1234".firstMatch(of: pattern)
377
+
print(match !=nil) // false
378
+
```
379
+
380
+
As a future direction, we could introduce some way of opting into this behavior.
364
381
365
382
#### Alternation: `a|b`
366
383
367
-
Alternations are used to match one of multiple patterns. An alternation wraps
368
-
its underlying pattern's capture type in an `Optional`.
384
+
Alternations are used to match one of multiple patterns. An alternation wraps its underlying patterns' capture types in an `Optional`s and concatenates them together, first to last.
369
385
370
386
```swift
371
387
let numberAlternationRegex =/([01]+)|[0-9]+|([0-9a-fA-F]+)/
@@ -512,27 +528,27 @@ For example, to be consistent with traditional regex backreferences quantificati
The structured capture type is safer because the type system encodes that there are an equal number of `lower` and `upper` hex numbers. It's also more convenient because you're likely to be processing `lower` and `upper` in parallel (e.g. to create ranges).
538
+
The structured capture types are safer because the type system encodes that there are an equal number of `lower` and `upper` hex numbers. It's also more convenient because you're likely to be processing `lower` and `upper` in parallel (e.g. to create ranges).
523
539
524
540
Similarly, alternations of multiple or nested captures produces flat optionals rather than a structured alternation type.
The structured capture type is safer because the type system encodes which options in the alternation of mutually exclusive. It'd also be much more convenient if, in the future, `Alternation` could behave like an enum, allowing exhaustive switching over all the options.
551
+
The structured capture types are safer because the type system encodes which options in the alternation of mutually exclusive. It'd also be much more convenient if, in the future, `Alternation` could behave like an enum, allowing exhaustive switching over all the options.
536
552
537
553
It's possible to derive the flat type from the structured type (but not vice versa), so `Regex` could be generic over the structured type and `firstMatch(of:)` could return a result type that vends both.
538
554
@@ -546,9 +562,9 @@ extension String {
546
562
}
547
563
```
548
564
549
-
This is cool, but it adds extra complexity to `Regex` and it isn't as clear because the generic type no longer aligns with the traditional regex backreference numbering. Because the primary motivation for providing regex literals in Swift is their familiarity, we think the consistency of the flat capture type trumps the added safety and ergonomics of the structured captures type.
565
+
This is cool, but it adds extra complexity to `Regex` and it isn't as clear because the generic type no longer aligns with the traditional regex backreference numbering. Because the primary motivation for providing regex literals in Swift is their familiarity, we think the consistency of the flat capture types trumps the added safety and ergonomics of the structured capture types.
550
566
551
-
We think the calculus probably flips in favor of a structured capture type for the result builder syntax, for which familiarity is not as high a priority.
567
+
We think the calculus probably flips in favor of a structured capture types for the result builder syntax, for which familiarity is not as high a priority.
552
568
553
569
## Future directions
554
570
@@ -566,7 +582,7 @@ public struct DynamicCaptures: Equatable, RandomAccessCollection {
566
582
subscript(position: Int) -> DynamicCaptures { get }
0 commit comments