-
Notifications
You must be signed in to change notification settings - Fork 49
Add a pitch describing matching semantic levels #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
```swift | ||
let cafe = "Cafe\u{301}" // "Café" | ||
cafe == "Café" // true: String equality comparison | ||
cafe.matches(/Café/) // true: canonical equivalence | ||
cafe.matches(/Cafe\u{301}/) // true: character recognition in regex literals | ||
cafe.matches(/Caf./) // true: dot matches a character | ||
cafe.matches(/Caf[åéîøü]/) // true: canonical equivalence within custom classes | ||
cafe.matches(/Caf\w/) // true: character class matches a character | ||
cafe.matches(/Caf\p{Letter}/) // true: Unicode property matches a character | ||
|
||
cafe.count == 4 // true | ||
cafe.matches(/.{4}/) // true: dot matches a character | ||
|
||
// false: the fourth `.` matches the whole `"e\u{301}"` character | ||
cafe.matches(/....\u{301}/) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to break these longer examples into explanations of what's going on in more/less each line. For example, why /Cafe\u{301}/
matches both "Café"
and "Cafe\u{301}"
when in Character
semantic mode.
Very quick read and a brusque (sorry) response. I feel like the character properties pitch opened some questions that this doesn't answer, such as the meaning of a custom character class comprised of combing scalars. If that's punted to another pitch, let's be explicit about it.
Err, I feel like this kinda sorta argues directly against the pitch. If these are weird or niche, why are we spilling so much ink and so much angst over it, if it also means the ASCII-only use case will be second class or non-default? And the immediate following question is, if you're doing a "sophisticated task", why are you using regular expressions?
How is this even possible given anything we've pitched or proposed so far?
Cue the horror of anyone concerned with internationalization. What is the definition of
Are the default semantics of quantification the same? What binary data processing use cases, other than as a seek-like query over small amounts of data, do you think a regular expression is a good idea for? It seems like the vast majority of the time, unrestricted backtracking is a total hazard, and working around this accumulates undue complexity.
And what if the diacritics are in the character class instead?
I'm not sure what this sentence is saying. Isn't this at odds with the statement that a string literal will match itself? What about scalar values inside string literals?
It took an unreasonably long time to realize you're trying to find a null-terminated C string. Actually, all of these examples take me an unreasonably long time to guess what they're doing and I'm unsure I got it right or would be able to maintain any code base that relied on these. Why is this the syntax? Is this a nice Swifty syntax or traditional regex syntax? Throughout the pitch I see "Swift regular expression syntax", but how Swifty is that? If we're pulling in another language's syntax for pragmatic reasons, then let's be explicit about it.
What about in grapheme semantic mode? Is that a dip down into scalar processing? Doesn't that violate the whole
Justification? The rest of this pitch has rationale, but I'm wondering why this is the first and foremost stake to put in the ground, especially lacking any rationale. A "single type" can be achieved by making an This "single type" is many types under a single wrapper. But, below we argue for ad-hoc conventions within those parameters, so we're not actually parametrically polymorphic, if I'm understanding the pitch correctly. That is, we're smuggling in an ad-hoc encoding of multiple types under a single umbrella by pretending to be parametric. This might be the right call in the end, but it shouldn't be left as an exercise for the reader to infer what's really going on.
How does this compare/contrast to other regex engines? Is this their spelling?
Is this just a single capture, or all captures in a regex?
This gets confusing pretty quickly. How about a short section that helps establish some basic reasoning? Regex are programs (or forumlas in a logic) to execute over some abstract model of "string". We have 3.5 such models of "string", here's how we change between models/levels. This should be fairly familiar to readers and generic collection algorithms are analogous. They are also algorithms/programs to run over an abstract model of something holding a bunch of elements. It's clear why you'd get different answers running
Note that both approaches could accumulate and encode the levels in the type statically. Wouldn't the type of captures potentially have to change from this operation? Are we accumulating any more static info or just dropping it on the floor?
And so does |
@milseman Thanks much for the feedback! I hope I've addressed your comments in this revision, especially in regards to behavior of character classes in the different semantic modes. A few specific notes below:
All the syntax in the regular expression examples is the familiarly confounding regular expression syntax found everywhere else. We aren't proposing significant changes to the structure of regular expressions, as that just becomes another overly terse micro-language for users to learn. The builder DSL is the right alternative to the regex syntax, as it allows the full richness of regular Swift code instead of whatever small additions we might want to bake in.
If we apply a regular expression to a general collection of UInt8, we're starting without any guarantees of encoding validity. In addition, even in a string, any capture group that doesn't have scalar-aligned endpoints is going to be potentially invalid UTF-8: let str = "My favorite emoji is 🥰"
let captured = str.utf8.dropLast()
// captured is a Slice<String.UTF8View>, and includes malformed UTF-8 at the end
print(String(decoding: captured, as: UTF8.self))
// Prints "My favorite emoji is \u{fffd}"
This bit is still a little bit TBD, in particular how we would propagate the default semantic mode if it were changed after creating the regular expression. |
The most important question, IMO, for this pitch to answer is where the grapheme boundaries are. Can you specify a reasoning process for where grapheme boundaries are inferred to be? If you're punting this, it should be stated explicitly.
Why? What's the rationale? If we're going to deviate from the stdlib's general philosophy of Unicode, we should at least mention why and how far this new precedent extends.
Why? Rationale? |
Changes `Regex` and result builder prototype to use `Match` as the generic parameter to make it consistent with the [Strongly Typed Regex Captures](https://forums.swift.org/t/pitch-strongly-typed-regex-captures/53391) pitch. Introduces `Tuple<n>` structs in order to be able to express constraints on capture types (i.e. `Match` dropped first) while being able to filter out empty captures in concatenation. `Tuple<n>` is also needed to implement a prototype of the [proposed matching semantics](swiftlang#64). As coercion into `Tuple<n>` can no longer use runtime magic like native tuples do, we incorporate child capture type information into RECode's `captureNil` and `captureArray` instructions so that we will always get a concrete type when forming a nil or an empty array capture. The resulting existential tuple capture can then be opened and bitcast to a `Tuple<n>`.
Changes `Regex` and result builder prototype to use `Match` as the generic parameter to make it consistent with the [Strongly Typed Regex Captures](https://forums.swift.org/t/pitch-strongly-typed-regex-captures/53391) pitch. Introduces `Tuple<n>` structs in order to be able to express constraints on capture types (i.e. `Match` dropped first) while being able to filter out empty captures in concatenation. `Tuple<n>` is also needed to implement a prototype of the [proposed matching semantics](swiftlang#64). As coercion into `Tuple<n>` can no longer use runtime magic like native tuples do, we incorporate child capture type information into RECode's `captureNil` and `captureArray` instructions so that we will always get a concrete type when forming a nil or an empty array capture. The resulting existential tuple capture can then be opened and bitcast to a `Tuple<n>`.
Changes `Regex` and result builder prototype to use `Match` as the generic parameter to make it consistent with the [Strongly Typed Regex Captures](https://forums.swift.org/t/pitch-strongly-typed-regex-captures/53391) pitch. Introduces `Tuple<n>` structs in order to be able to express constraints on capture types (i.e. `Match` dropped first) while being able to filter out empty captures in concatenation. `Tuple<n>` is also needed to implement a prototype of the [proposed matching semantics](swiftlang#64). As coercion into `Tuple<n>` can no longer use runtime magic like native tuples do, we incorporate child capture type information into RECode's `captureNil` and `captureArray` instructions so that we will always get a concrete type when forming a nil or an empty array capture. The resulting existential tuple capture can then be opened and bitcast to a `Tuple<n>`.
Changes `Regex` and result builder prototype to use `Match` as the generic parameter to make it consistent with the [Strongly Typed Regex Captures](https://forums.swift.org/t/pitch-strongly-typed-regex-captures/53391) pitch. Introduces `Tuple<n>` structs in order to be able to express constraints on capture types (i.e. `Match` dropped first) while being able to filter out empty captures in concatenation. `Tuple<n>` is also needed to implement a prototype of the [proposed matching semantics](swiftlang#64). As coercion into `Tuple<n>` can no longer use runtime magic like native tuples do, we incorporate child capture type information into RECode's `captureNil` and `captureArray` instructions so that we will always get a concrete type when forming a nil or an empty array capture. The resulting existential tuple capture can then be opened and bitcast to a `Tuple<n>`.
Changes `Regex` and result builder prototype to use `Match` as the generic parameter to make it consistent with the [Strongly Typed Regex Captures](https://forums.swift.org/t/pitch-strongly-typed-regex-captures/53391) pitch. Introduces `Tuple<n>` structs in order to be able to express constraints on capture types (i.e. `Match` dropped first) while being able to filter out empty captures in concatenation. `Tuple<n>` is also needed to implement a prototype of the [proposed matching semantics](swiftlang#64). As coercion into `Tuple<n>` can no longer use runtime magic like native tuples do, we incorporate child capture type information into RECode's `captureNil` and `captureArray` instructions so that we will always get a concrete type when forming a nil or an empty array capture. The resulting existential tuple capture can then be opened and bitcast to a `Tuple<n>`.
Changes `Regex` and result builder prototype to use `Match` as the generic parameter to make it consistent with the [Strongly Typed Regex Captures](https://forums.swift.org/t/pitch-strongly-typed-regex-captures/53391) pitch. Introduces `Tuple<n>` structs in order to be able to express constraints on capture types (i.e. `Match` dropped first) while being able to filter out empty captures in concatenation. `Tuple<n>` is also needed to implement a prototype of the [proposed matching semantics](swiftlang#64). As coercion into `Tuple<n>` can no longer use runtime magic like native tuples do, we incorporate child capture type information into RECode's `captureNil` and `captureArray` instructions so that we will always get a concrete type when forming a nil or an empty array capture. The resulting existential tuple capture can then be opened and bitcast to a `Tuple<n>`.
Changes `Regex` and result builder prototype to use `Match` as the generic parameter to make it consistent with the [Strongly Typed Regex Captures](https://forums.swift.org/t/pitch-strongly-typed-regex-captures/53391) pitch. Introduces `Tuple<n>` structs in order to be able to express constraints on capture types (i.e. `Match` dropped first) while being able to filter out empty captures in concatenation. `Tuple<n>` is also needed to implement a prototype of the [proposed matching semantics](swiftlang#64). As coercion into `Tuple<n>` can no longer use runtime magic like native tuples do, we incorporate child capture type information into RECode's `captureNil` and `captureArray` instructions so that we will always get a concrete type when forming a nil or an empty array capture. The resulting existential tuple capture can then be opened and bitcast to a `Tuple<n>`.
@swift-ci Please test macOS platform |
1 similar comment
@swift-ci Please test macOS platform |
ping what's the status of this PR? |
In-progress pitch describing the three different semantic levels for regular expressions, how they match differently, and their effect on capture types.