Skip to content

Commit 83fedac

Browse files
author
Kyle Macomber
committed
Incorporate @beccadax's feedback into StronglyTypedCaptures.md
1 parent 4326e1e commit 83fedac

File tree

1 file changed

+49
-33
lines changed

1 file changed

+49
-33
lines changed

Documentation/Evolution/StronglyTypedCaptures.md

+49-33
Original file line numberDiff line numberDiff line change
@@ -37,10 +37,10 @@ if let match = "abcddddefgh".firstMatch(of: regex) {
3737

3838
>_**Note:** The `Regex` type includes, and `firstMatch(of:)` returns, the entire match as the "0th element"._
3939
40-
We introduce a generic type `Regex<Match>`, which treats the type of captures as part of a regular expression's type information for clarity, type safety, and convenience. As we explore a fundamental design aspect of the regular expression feature, this pitch discusses the following topics:
40+
We introduce a generic type `Regex<Match>`, which treats the capture types as part of a regular expression's type information for clarity, type safety, and convenience. As we explore a fundamental design aspect of the regular expression feature, this pitch discusses the following topics:
4141

4242
- A type definition of the generic type `Regex<Match>` and `firstMatch(of:)` method.
43-
- Capture type inference and composition in regular expression literals and the forthcoming result builder syntax.
43+
- Inference and composition of capture types in regular expression literals and the forthcoming result builder syntax.
4444
- New language features which this design may require.
4545

4646
The focus of this pitch is the structural properties of capture types and how regular expression patterns compose to form new capture types. The semantics of string matching, its effect on the capture types (i.e. `UnicodeScalarView.SubSequence` or `Substring`), and the result builder syntax will be discussed in future pitches.
@@ -52,7 +52,7 @@ For background on Declarative String Processing, see related topics:
5252

5353
## Motivation
5454

55-
Across a variety of programming languages, many established regular expression libraries present captures as a collection of captured content to the caller upon a successful match [[1](https://developer.apple.com/documentation/foundation/nsregularexpression)][[2](https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.capture)]. However, to know the structure of captured contents, programmers often need to carefully read the regular expression or run the regular expression on some input to find out. Because regular expressions are oftentimes statically available in the source code, there is a missed opportunity to use generics to present captures as part of type information to the programmer, and to leverage the compiler to infer the type of captures based on a regular expression literal. As we propose to introduce declarative string processing capabilities to the language and the Standard Library, we would like to explore a type-safe approach to regular expression captures.
55+
Across a variety of programming languages, many established regular expression libraries present a collection of captured content to the caller upon a successful match [[1](https://developer.apple.com/documentation/foundation/nsregularexpression)][[2](https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.capture)]. However, to know the structure of captured contents, programmers often need to carefully read the regular expression or run the regular expression on some input to find out. Because regular expressions are oftentimes statically available in the source code, there is a missed opportunity to use generics to present captures as part of type information to the programmer, and to leverage the compiler to infer the type of captures based on a regular expression literal. As we propose to introduce declarative string processing capabilities to the language and the Standard Library, we would like to explore a type-safe approach to regular expression captures.
5656

5757
## Proposed solution
5858

@@ -77,6 +77,15 @@ if let match = "abcddddefgh".firstMatch(of: regex) {
7777
}
7878
```
7979

80+
Quantifiers (`*`, `+`, and `?`) and alternations (`|`) wrap each capture inside them in `Array` or `Optional`. These structures can be nested, so a capture which is inside multiple levels of quantifiers or alternations will end up with a type like `[Substring?]?`. To ensure that backreference numbering and tuple element numbering match, each capture is separately wrapped in the structure implied by the quantifiers and alternations around it, rather than wrapping tuples of adjacent captures in the structure.
81+
82+
```swift
83+
let regex = /ab(?:c(d)*(ef))?gh/
84+
if let match = "abcddddefgh".firstMatch(of: regex) {
85+
print((match.1, match.2)) // => (Optional(["d","d","d","d"]), Optional("ef"))
86+
}
87+
```
88+
8089
## Detailed design
8190

8291
### `Regex` type
@@ -114,7 +123,7 @@ if let match = line.firstMatch(of: scalarRangePattern) {
114123

115124
> ***Note**: Additional features like efficient access to the matched ranges are out-of-scope for this pitch, but will likely mean returning a nominal type from `firstMatch(of:)`. In this pitch, the result type of `firstMatch(of:)` is a tuple of `Substring`s for simplicity and brevity. Either way, the developer experience is meant to be light-weight and tuple-y. Any nominal type would likely come with dynamic member lookup for accessing captures by index (i.e. `.0`, `.1`, etc.) and name.*
116125
117-
### Capture type
126+
### Capture types
118127

119128
In this section, we describe the inferred capture types for regular expression patterns and how they compose.
120129

@@ -126,11 +135,7 @@ By default, a regular expression literal has type `Regex`. Its generic argument
126135
Capture types
127136
```
128137

129-
When there are no captures, `Match` is just the entire matched substring.
130-
131-
#### Basics
132-
133-
Regular expressions without any capturing groups have type `Regex<Substring>`, for example:
138+
When there are no captures, `Match` is just the entire matched substring, for example:
134139

135140
```swift
136141
let identifier = /[_a-zA-Z]+[_a-zA-Z0-9]*/ // => `Regex<Substring>`
@@ -142,9 +147,11 @@ let identifier = /[_a-zA-Z]+[_a-zA-Z0-9]*/ // => `Regex<Substring>`
142147
// }
143148
```
144149

150+
This falls out of Swift's normal type system rules, which treat a 1-tuple as synonymous with the element itself.
151+
145152
#### Capturing group: `(...)`
146153

147-
A capturing group saves the portion of the input matched by its contained pattern. Its capture type is `Substring`.
154+
A capturing group saves the portion of the input matched by its contained pattern. The capture type of a leaf capturing group is `Substring`.
148155

149156
```swift
150157
let graphemeBreakLowerBound = /([0-9a-fA-F]+)/
@@ -156,7 +163,7 @@ let graphemeBreakLowerBound = /([0-9a-fA-F]+)/
156163

157164
#### Concatenation: `abc`
158165

159-
A concatenation's `Match` is a tuple of `Substring`s followed by every pattern's capture type. When there are no capturing groups, the `Match` is just `Substring`.
166+
A concatenation's capture types are a concatenation of the capture types of its underlying patterns, ignoring any underlying patterns with no captures.
160167

161168
```swift
162169
let graphemeBreakLowerBound = /([0-9a-fA-F]+)\.\.[0-9a-fA-F]+/
@@ -182,7 +189,7 @@ let graphemeBreakRange = /([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)/
182189

183190
#### Named capturing group: `(?<name>...)`
184191

185-
A named capturing group's capture type is `Substring`. In its `Match` type, the capture type has a tuple element label specified by the capture name.
192+
A named capturing group includes the capture's name as the label of the tuple element.
186193

187194
```swift
188195
let graphemeBreakLowerBound = /(?<lower>[0-9a-fA-F]+)\.\.[0-9a-fA-F]+/
@@ -194,7 +201,7 @@ let graphemeBreakRange = /(?<lower>[0-9a-fA-F]+)\.\.(?<upper>[0-9a-fA-F]+)/
194201

195202
#### Non-capturing group: `(?:...)`
196203

197-
A non-capturing group's capture type is the same as its underlying pattern's. That is, it does not capture anything by itself, but transparently propagates its underlying pattern's captures.
204+
A non-capturing group's capture types are the same as its underlying pattern's. That is, it does not capture anything by itself, but transparently propagates its underlying pattern's captures.
198205

199206
```swift
200207
let graphemeBreakLowerBound = /([0-9a-fA-F]+)(?:\.\.([0-9a-fA-F]+))?/
@@ -212,7 +219,7 @@ let graphemeBreakLowerBound = /([0-9a-fA-F]+)(?:\.\.([0-9a-fA-F]+))?/
212219

213220
#### Nested capturing group: `(...(...))`
214221

215-
When capturing group is nested within another capturing group, they count as two distinct captures in the order their left parenthesis first appears in the regular expression literal. This is consistent with traditional regex backreference numbering.
222+
When a capturing group is nested within another capturing group, they count as two distinct captures in the order their left parenthesis first appears in the regular expression literal. This is consistent with traditional regex backreference numbering.
216223

217224
```swift
218225
let graphemeBreakPropertyData = /(([0-9a-fA-F]+)(\.\.([0-9a-fA-F]+)))\s*;\s(\w+).*/
@@ -246,16 +253,16 @@ let input = "007F..009F ; Control"
246253

247254
#### Quantification: `*`, `+`, `?`, `{n}`, `{n,}`, `{n,m}`
248255

249-
A quantifier wraps its underlying pattern's capture type in either an `Optional` or `Array`. Zero-or-one quantification (`?`) produces an `Optional` and all others produce an `Array`. The kind of quantification, i.e. greedy vs reluctant vs possessive, is irrelevant to determining the capture type.
256+
A quantifier wraps its underlying pattern's capture types in either an `Optional`s or `Array`s. Zero-or-one quantification (`?`) produces an `Optional` and all others produce an `Array`. The kind of quantification, i.e. greedy vs reluctant vs possessive, is irrelevant to determining the capture type.
250257

251258
| Syntax | Description | Capture type |
252259
| -------------------- | --------------------- | ------------------------------------------------------------- |
253-
| `*` | 0 or more | `Array` of the sub-pattern capture type |
254-
| `+` | 1 or more | `Array` of the sub-pattern capture type |
255-
| `?` | 0 or 1 | `Optional` of the sub-pattern capture type |
256-
| `{n}` | Exactly _n_ | `Array` of the sub-pattern capture type |
257-
| `{n,m}` | Between _n_ and _m_ | `Array` of the sub-pattern capture type |
258-
| `{n,}` | _n_ or more | `Array` of the sub-pattern capture type |
260+
| `*` | 0 or more | `Array`s of the sub-pattern capture types |
261+
| `+` | 1 or more | `Array`s of the sub-pattern capture types |
262+
| `?` | 0 or 1 | `Optional`s of the sub-pattern capture types |
263+
| `{n}` | Exactly _n_ | `Array`s of the sub-pattern capture types |
264+
| `{n,m}` | Between _n_ and _m_ | `Array`s of the sub-pattern capture types |
265+
| `{n,}` | _n_ or more | `Array`s of the sub-pattern capture types |
259266

260267
```swift
261268
/([0-9a-fA-F]+)+/
@@ -360,12 +367,21 @@ if let match = "1234-5678-9abc-def0".firstMatch(of: pattern) {
360367
// Prints ["1234", "5678", "9abc", "def0"]
361368
```
362369

363-
We believe that the proposed capture behavior leads to better consistency with the meaning of these quantifiers. However, the alternative behavior does have the advantage of a smaller memory footprint because the matching algorithm would not need to allocate storage for capturing anything but the last match. As a future direction, we could introduce some way of opting into this behavior.
370+
We believe that the proposed capture behavior is more intuitive. However, the alternative behavior has a smaller memory footprint and is more consistent with usage of backreferences, which only refer to the last match of the repeated capture group:
371+
372+
```swift
373+
let pattern = /(?:([0-9a-fA-F]+)-?)+ \1/
374+
var match = "1234-5678-9abc-def0 def0".firstMatch(of: pattern)
375+
print(match != nil) // true
376+
var match = "1234-5678-9abc-def0 1234".firstMatch(of: pattern)
377+
print(match != nil) // false
378+
```
379+
380+
As a future direction, we could introduce some way of opting into this behavior.
364381

365382
#### Alternation: `a|b`
366383

367-
Alternations are used to match one of multiple patterns. An alternation wraps
368-
its underlying pattern's capture type in an `Optional`.
384+
Alternations are used to match one of multiple patterns. An alternation wraps its underlying patterns' capture types in an `Optional`s and concatenates them together, first to last.
369385

370386
```swift
371387
let numberAlternationRegex = /([01]+)|[0-9]+|([0-9a-fA-F]+)/
@@ -512,27 +528,27 @@ For example, to be consistent with traditional regex backreferences quantificati
512528

513529
```swift
514530
/(?:(?<lower>[0-9a-fA-F]+)\.\.(?<upper>[0-9a-fA-F]+))+/
515-
// Flat capture type:
531+
// Flat capture types:
516532
// => `Regex<(Substring, lower: [Substring], upper: [Substring])>`
517533

518-
// Structured capture type:
534+
// Structured capture types:
519535
// => `Regex<(Substring, [(lower: Substring, upper: Substring)])>`
520536
```
521537

522-
The structured capture type is safer because the type system encodes that there are an equal number of `lower` and `upper` hex numbers. It's also more convenient because you're likely to be processing `lower` and `upper` in parallel (e.g. to create ranges).
538+
The structured capture types are safer because the type system encodes that there are an equal number of `lower` and `upper` hex numbers. It's also more convenient because you're likely to be processing `lower` and `upper` in parallel (e.g. to create ranges).
523539

524540
Similarly, alternations of multiple or nested captures produces flat optionals rather than a structured alternation type.
525541

526542
```swift
527543
/([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)|([0-9a-fA-F]+)/
528-
// Flat capture type:
544+
// Flat capture types:
529545
// => `Regex<(Substring, Substring?, Substring?, Substring?)>`
530546

531-
// Structured capture type:
547+
// Structured capture types:
532548
// => `Regex<(Substring, Alternation<((Substring, Substring), Substring)>)>`
533549
```
534550

535-
The structured capture type is safer because the type system encodes which options in the alternation of mutually exclusive. It'd also be much more convenient if, in the future, `Alternation` could behave like an enum, allowing exhaustive switching over all the options.
551+
The structured capture types are safer because the type system encodes which options in the alternation of mutually exclusive. It'd also be much more convenient if, in the future, `Alternation` could behave like an enum, allowing exhaustive switching over all the options.
536552

537553
It's possible to derive the flat type from the structured type (but not vice versa), so `Regex` could be generic over the structured type and `firstMatch(of:)` could return a result type that vends both.
538554

@@ -546,9 +562,9 @@ extension String {
546562
}
547563
```
548564

549-
This is cool, but it adds extra complexity to `Regex` and it isn't as clear because the generic type no longer aligns with the traditional regex backreference numbering. Because the primary motivation for providing regex literals in Swift is their familiarity, we think the consistency of the flat capture type trumps the added safety and ergonomics of the structured captures type.
565+
This is cool, but it adds extra complexity to `Regex` and it isn't as clear because the generic type no longer aligns with the traditional regex backreference numbering. Because the primary motivation for providing regex literals in Swift is their familiarity, we think the consistency of the flat capture types trumps the added safety and ergonomics of the structured capture types.
550566

551-
We think the calculus probably flips in favor of a structured capture type for the result builder syntax, for which familiarity is not as high a priority.
567+
We think the calculus probably flips in favor of a structured capture types for the result builder syntax, for which familiarity is not as high a priority.
552568

553569
## Future directions
554570

@@ -566,7 +582,7 @@ public struct DynamicCaptures: Equatable, RandomAccessCollection {
566582
subscript(position: Int) -> DynamicCaptures { get }
567583
}
568584

569-
extension Regex where Captures == DynamicCaptures {
585+
extension Regex where Match == (Substring, DynamicCaptures) {
570586
public init(_ string: String) throws
571587
}
572588
```

0 commit comments

Comments
 (0)