[5.7] Cherry-pick parser changes to 5.7 #309

hamishknight · 2022-04-21T10:35:50Z

Cherry-pick of:

Most notably, this allows scalar escape sequences such as \n to be matched against, fixes non-semantic whitespace in custom character classes, corrects matching option scoping behavior, and starts rejecting invalid character properties at compile time.

This isn't actually used, as we convert to a DSL custom character class, and then use that consumer logic.

Convert AST escape sequences that represent a scalar value (e.g `\f`, `n`, `\a`) into scalars in the DSL tree. This allows the matching engine to match against them.

ICU and Oniguruma allow custom character classes to begin with `:`, and only lex a POSIX character property if they detect a closing `:]`. However their behavior for this differs: - ICU will consider *any* `:]` in the regex as a closing delimiter, even e.g `[[:a]][:]`. - Oniguruma will stop if it hits a `]`, so `[[:a]][:]` is treated as a custom character class. However it only scans ahead 20 chars max, and doesn't stop for e.g a nested character class opening `[`. Our detection behavior for this is as follows: - When `[:` is encountered inside a custom character class, scan ahead to the closing `:]`. - While scanning, bail if we see any characters that are obviously invalid property names. Currently this includes `[`, `]`, `}`, as well as a second occurrence of `=`. - Otherwise, if we end on `:]`, consider that a POSIX character property. We could include more metacharacters to bail on, e.g `{`, `(`, `)`, but for now I'm tending on the side of lexing an invalid POSIX character property. We can always relax this in the future (as we'd be turning invalid code into valid code). Users can always escape the initial `:` in `[:` if they want a custom character class. In fact, we may want to suggest this via a warning, as this behavior can be pretty subtle.

This matches the ICU behavior, and appears to be suggested by UTS#18.

Rather than matching and not advancing the input, we should always return `nil` to never match against the trivia.

Previously we would check for an empty array of members when deciding whether an initial `]` is literal, or if the operands of a set operation are invalid. Switch to checking whether we have any semantic members instead.

This should be unreachable, let's make sure of that. Doing so requires generalizing the handling of LocatedError a bit.

Previously we would always parse a "change matching option" sequence as a group, and for the isolated syntax e.g `(?x)`, we would have it implicitly wrap everything after in the same group by having it do a recursive parse. This matched the Oniguruma behavior for such isolated groups, and was easy to implement, but its behavior is quite unintuitive when it comes to alternations, as e.g `a(?x)b|c` becomes `a(?x:b|c)`, which may not be expected. Instead, let's follow PCRE's behavior by having such isolated cases affect the syntax options for the remainder of the current group, including across alternation branches. This is done by lexing such cases as atoms (as they aren't really group-like anymore), and having them change the syntax options when we encounter them. The existing scoping rules around groups take care of resetting the options when we exit the scope.

Previously we would form an `.other` character property kind for any unclassified properties, which crash at runtime as unsupported. Instead, switch to erroring on them. Eventually it would be nice if we could version this based on what the runtime being targeted supports.

Add backslash to the list of characters we don't consider valid for a character property name. This means that we'll bail when attempting to lex a POSIX character property and instead lex a custom character class. This allows e.g `[:\Q :] \E]` to be lexed as a custom character class. For `\p{...}` this just means we'll emit a truncated invalid property error, which is arguably more inline with what the user was expecting.. I noticed when digging through the ICU source code that it will bail out of parsing a POSIX character property if it encounters one of its known escape sequences (e.g `\a`, `\e`, `\f`, ...). Interestingly this doesn't cover character property escapes e.g `\d`, but it's not clear that is intentional. Given backslash is not a valid character property character anyway, it seems reasonable to broaden this behavior to bail on any backslash.

hamishknight · 2022-04-21T17:05:27Z

@swift-ci please test

milseman

LGTM

hamishknight mentioned this pull request Apr 21, 2022

[5.7] [DNM] Null PR swiftlang/swift#42532

Closed

hamishknight requested review from milseman and stephentyrone April 21, 2022 10:38

hamishknight added 11 commits April 21, 2022 18:03

Fix HexDigit definition in RegexSyntax.md

3f2832d

Remove AST CustomCharacterClass consumer generation

3cce15d

This isn't actually used, as we convert to a DSL custom character class, and then use that consumer logic.

Convert scalar escape sequences to DSL scalars

577dc6e

Convert AST escape sequences that represent a scalar value (e.g `\f`, `n`, `\a`) into scalars in the DSL tree. This allows the matching engine to match against them.

Allow POSIX character properties outside of custom character classes

5912ab4

This matches the ICU behavior, and appears to be suggested by UTS#18.

Fix character class trivia matching

c638486

Rather than matching and not advancing the input, we should always return `nil` to never match against the trivia.

Fix trivia parsing for set operations and initial ] cases

f053dc3

Previously we would check for an empty array of members when deciding whether an initial `]` is literal, or if the operands of a set operation are invalid. Switch to checking whether we have any semantic members instead.

Throw error if we encounter stray opening '('

e84c93d

This should be unreachable, let's make sure of that. Doing so requires generalizing the handling of LocatedError a bit.

hamishknight force-pushed the parser-changes-5.7 branch from 81f1350 to 771e735 Compare April 21, 2022 17:04

stephentyrone requested a review from airspeedswift April 21, 2022 18:10

stephentyrone approved these changes Apr 21, 2022

View reviewed changes

milseman approved these changes Apr 22, 2022

View reviewed changes

airspeedswift approved these changes Apr 22, 2022

View reviewed changes

hamishknight merged commit 29bc5da into swiftlang:swift/release/5.7 Apr 22, 2022

hamishknight deleted the parser-changes-5.7 branch April 22, 2022 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[5.7] Cherry-pick parser changes to 5.7 #309

[5.7] Cherry-pick parser changes to 5.7 #309

hamishknight commented Apr 21, 2022 •

edited

Loading

hamishknight commented Apr 21, 2022

milseman left a comment

[5.7] Cherry-pick parser changes to 5.7 #309

[5.7] Cherry-pick parser changes to 5.7 #309

Conversation

hamishknight commented Apr 21, 2022 • edited Loading

hamishknight commented Apr 21, 2022

milseman left a comment

Choose a reason for hiding this comment

hamishknight commented Apr 21, 2022 •

edited

Loading