Add a pitch describing matching semantic levels #64

natecook1000 · 2021-12-10T19:22:33Z

In-progress pitch describing the three different semantic levels for regular expressions, how they match differently, and their effect on capture types.

natecook1000 · 2021-12-10T19:24:24Z

Documentation/Evolution/MatchingSemantics.md

+```swift
+let cafe = "Cafe\u{301}"      // "Café"
+cafe == "Café"                // true: String equality comparison
+cafe.matches(/Café/)          // true: canonical equivalence
+cafe.matches(/Cafe\u{301}/)   // true: character recognition in regex literals
+cafe.matches(/Caf./)          // true: dot matches a character
+cafe.matches(/Caf[åéîøü]/)    // true: canonical equivalence within custom classes
+cafe.matches(/Caf\w/)         // true: character class matches a character
+cafe.matches(/Caf\p{Letter}/) // true: Unicode property matches a character
+
+cafe.count == 4               // true
+cafe.matches(/.{4}/)          // true: dot matches a character
+
+// false: the fourth `.` matches the whole `"e\u{301}"` character
+cafe.matches(/....\u{301}/)
+```


I think I need to break these longer examples into explanations of what's going on in more/less each line. For example, why /Cafe\u{301}/ matches both "Café" and "Cafe\u{301}" when in Character semantic mode.

Documentation/Evolution/MatchingSemantics.md

milseman · 2021-12-10T22:28:04Z

Very quick read and a brusque (sorry) response. I feel like the character properties pitch opened some questions that this doesn't answer, such as the meaning of a custom character class comprised of combing scalars. If that's punted to another pitch, let's be explicit about it.

Many string processing tasks, like the example in the overview of processing a Unicode data
table, operate on ASCII-only input data. It should be simple and easy to use Swift regular
expressions to match and extract captures from such data.

For more sophisticated tasks ...

Err, I feel like this kinda sorta argues directly against the pitch. If these are weird or niche, why are we spilling so much ink and so much angst over it, if it also means the ASCII-only use case will be second class or non-default? And the immediate following question is, if you're doing a "sophisticated task", why are you using regular expressions?

like capturing a Substring over a range of bytes that isn't valid UTF-8.

How is this even possible given anything we've pitched or proposed so far?

An example of such a task would be eliminating duplicate words

Cue the horror of anyone concerned with internationalization. What is the definition of \b? What are you really trying to do?

Binary data processing

Are the default semantics of quantification the same? What binary data processing use cases, other than as a seek-like query over small amounts of data, do you think a regular expression is a good idea for? It seems like the vast majority of the time, unrestricted backtracking is a total hazard, and working around this accumulates undue complexity.

cafe.contains(/Caf[åéîòü]/) // true

And what if the diacritics are in the character class instead?

Regular expressions can match strings with a specific number of characters, but don't delve into the constitutive parts of a character to match individual Unicode scalars.

I'm not sure what this sentence is saying. Isn't this at odds with the statement that a string literal will match itself? What about scalar values inside string literals?

let bsonInt32Regex = /\x10(?u:(?:.+\x00))(?b:(?:.{4}))/

It took an unreasonably long time to realize you're trying to find a null-terminated C string. Actually, all of these examples take me an unreasonably long time to guess what they're doing and I'm unsure I got it right or would be able to maintain any code base that relied on these.

Why is this the syntax? Is this a nice Swifty syntax or traditional regex syntax? Throughout the pitch I see "Swift regular expression syntax", but how Swifty is that? If we're pulling in another language's syntax for pragmatic reasons, then let's be explicit about it.

Similarly, a Unicode property metacharacter such as \p{Letter} always matches at the Unicode scalar value level, even if the regular expression is using byte semantics at the time.

What about in grapheme semantic mode? Is that a dip down into scalar processing? Doesn't that violate the whole . represents a single unit? That is, if you replace \p{Letter} inside an otherwise grapheme-semantic literal with ., does this completely changes how the string gets processed? I would normally expect \p{Letter} to call Character.isLetter by default.

The standard library should include a single Regex type

Justification? The rest of this pitch has rationale, but I'm wondering why this is the first and foremost stake to put in the ground, especially lacking any rationale.

A "single type" can be achieved by making an enum of arbitrarily many types, but that doesn't mean you're providing benefit. That has to be weighted against how many times you have to qualify your statements, ala "if it's applied to a binary collection, then it does this...".

This "single type" is many types under a single wrapper. But, below we argue for ad-hoc conventions within those parameters, so we're not actually parametrically polymorphic, if I'm understanding the pitch correctly. That is, we're smuggling in an ad-hoc encoding of multiple types under a single umbrella by pretending to be parametric.

This might be the right call in the end, but it shouldn't be left as an exercise for the reader to infer what's really going on.

Enabling and disabling these modes have the behavior of pushing and popping from a stack. The compiler will warn when a mode has been ended (e.g. (?-b)) before it has been enabled.

How does this compare/contrast to other regex engines? Is this their spelling?

However, if a capture includes any portion of a regular expression with byte semantics, or if a regular expression is applied to a non-string collection, then the capture group is captured as a collection that does not provide a UTF-8 correctness guarantee

Is this just a single capture, or all captures in a regex?

UnicodeScalar-based semantics mean that matching with a regular expression operates directly on the Unicode scalar values that comprise a string, matching many other regular expression engines. At this semantic level, a "dot" or character class matches a single Unicode scalar value, and canonical equivalence is not used when comparing

This gets confusing pretty quickly. How about a short section that helps establish some basic reasoning? Regex are programs (or forumlas in a logic) to execute over some abstract model of "string". We have 3.5 such models of "string", here's how we change between models/levels.

This should be fairly familiar to readers and generic collection algorithms are analogous. They are also algorithms/programs to run over an abstract model of something holding a bunch of elements. It's clear why you'd get different answers running string.unicodeScalars.forEach(print) than string.utf8.forEach(print).

You can turn on UnicodeScalar semantics by using the u flag in a regular expression literal or the unicodeScalarSemantics property on a regular expression instance.

Note that both approaches could accumulate and encode the levels in the type statically. Wouldn't the type of captures potentially have to change from this operation? Are we accumulating any more static info or just dropping it on the floor?

Substring includes a guarantee that its contents are valid UTF-8.

And so does Slice<String.UTF8View>. The whole point is that Substring indices are scalar aligned while sub-scalar views might not be.

Documentation/Evolution/MatchingSemantics.md

natecook1000 · 2021-12-17T16:38:31Z

@milseman Thanks much for the feedback! I hope I've addressed your comments in this revision, especially in regards to behavior of character classes in the different semantic modes. A few specific notes below:

It took an unreasonably long time to realize you're trying to find a null-terminated C string. Actually, all of these examples take me an unreasonably long time to guess what they're doing and I'm unsure I got it right or would be able to maintain any code base that relied on these.

Why is this the syntax? Is this a nice Swifty syntax or traditional regex syntax? Throughout the pitch I see "Swift regular expression syntax", but how Swifty is that? If we're pulling in another language's syntax for pragmatic reasons, then let's be explicit about it.

All the syntax in the regular expression examples is the familiarly confounding regular expression syntax found everywhere else. We aren't proposing significant changes to the structure of regular expressions, as that just becomes another overly terse micro-language for users to learn. The builder DSL is the right alternative to the regex syntax, as it allows the full richness of regular Swift code instead of whatever small additions we might want to bake in.

like capturing a Substring over a range of bytes that isn't valid UTF-8.
How is this even possible given anything we've pitched or proposed so far?

Substring includes a guarantee that its contents are valid UTF-8.
And so does Slice<String.UTF8View>. The whole point is that Substring indices are scalar aligned while sub-scalar views might not be.

If we apply a regular expression to a general collection of UInt8, we're starting without any guarantees of encoding validity. In addition, even in a string, any capture group that doesn't have scalar-aligned endpoints is going to be potentially invalid UTF-8:

let str = "My favorite emoji is 🥰"
let captured = str.utf8.dropLast()
// captured is a Slice<String.UTF8View>, and includes malformed UTF-8 at the end
print(String(decoding: captured, as: UTF8.self))
// Prints "My favorite emoji is \u{fffd}"

You can turn on UnicodeScalar semantics by using the u flag in a regular expression literal or the unicodeScalarSemantics property on a regular expression instance.

Note that both approaches could accumulate and encode the levels in the type statically. Wouldn't the type of captures potentially have to change from this operation? Are we accumulating any more static info or just dropping it on the floor?

This bit is still a little bit TBD, in particular how we would propagate the default semantic mode if it were changed after creating the regular expression.

milseman · 2021-12-20T16:21:48Z

The most important question, IMO, for this pitch to answer is where the grapheme boundaries are. Can you specify a reasoning process for where grapheme boundaries are inferred to be?

If you're punting this, it should be stated explicitly.

For a grapheme cluster to match a character class, all Unicode scalars that constitute the grapheme cluster must belong to the requisite Unicode categories.

Why? What's the rationale? If we're going to deviate from the stdlib's general philosophy of Unicode, we should at least mention why and how far this new precedent extends.

Grapheme recognition does not apply across custom character class definitions, so attempting to match the individual parts of a decomposed character does not succeed in character semantic mode.

Why? Rationale?

Changes `Regex` and result builder prototype to use `Match` as the generic parameter to make it consistent with the [Strongly Typed Regex Captures](https://forums.swift.org/t/pitch-strongly-typed-regex-captures/53391) pitch. Introduces `Tuple<n>` structs in order to be able to express constraints on capture types (i.e. `Match` dropped first) while being able to filter out empty captures in concatenation. `Tuple<n>` is also needed to implement a prototype of the [proposed matching semantics](swiftlang#64). As coercion into `Tuple<n>` can no longer use runtime magic like native tuples do, we incorporate child capture type information into RECode's `captureNil` and `captureArray` instructions so that we will always get a concrete type when forming a nil or an empty array capture. The resulting existential tuple capture can then be opened and bitcast to a `Tuple<n>`.

natecook1000 · 2022-02-01T20:00:43Z

@swift-ci Please test macOS platform

natecook1000 · 2022-02-01T22:30:53Z

@swift-ci Please test macOS platform

milseman · 2022-02-28T18:31:25Z

ping what's the status of this PR?

Add a pitch describing matching semantic levels

51e51e7

natecook1000 marked this pull request as draft December 10, 2021 19:22

natecook1000 commented Dec 10, 2021

View reviewed changes

rxwei reviewed Dec 10, 2021

View reviewed changes

Documentation/Evolution/MatchingSemantics.md Outdated Show resolved Hide resolved

kylemacomber reviewed Dec 11, 2021

View reviewed changes

Documentation/Evolution/MatchingSemantics.md Outdated Show resolved Hide resolved

Kyle Macomber and others added 2 commits December 10, 2021 16:39

Fix strongly type regex captures link

9da476e

Revisions to the semantics pitch

22f9b03

hamishknight mentioned this pull request Dec 20, 2021

Parse matching options #91

Merged

natecook1000 closed this May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a pitch describing matching semantic levels #64

Add a pitch describing matching semantic levels #64

Uh oh!

natecook1000 commented Dec 10, 2021

Uh oh!

natecook1000 Dec 10, 2021

Uh oh!

Uh oh!

milseman commented Dec 10, 2021

Uh oh!

Uh oh!

natecook1000 commented Dec 17, 2021

Uh oh!

milseman commented Dec 20, 2021

Uh oh!

natecook1000 commented Feb 1, 2022

Uh oh!

natecook1000 commented Feb 1, 2022

Uh oh!

milseman commented Feb 28, 2022

Uh oh!

Uh oh!

Add a pitch describing matching semantic levels #64

Add a pitch describing matching semantic levels #64

Uh oh!

Conversation

natecook1000 commented Dec 10, 2021

Uh oh!

natecook1000 Dec 10, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

milseman commented Dec 10, 2021

Uh oh!

Uh oh!

natecook1000 commented Dec 17, 2021

Uh oh!

milseman commented Dec 20, 2021

Uh oh!

natecook1000 commented Feb 1, 2022

Uh oh!

natecook1000 commented Feb 1, 2022

Uh oh!

milseman commented Feb 28, 2022

Uh oh!

Uh oh!