revised algorithm 5 7 #2

itingliu · 2022-04-22T19:15:38Z

Pitch: String processing algorithms (Pitch: String processing algorithms swiftlang/swift-experimental-string-processing#188)
RegexBuilder module
Fill out remainder of options API (Expanded options swiftlang/swift-experimental-string-processing#246)
Clean up based on the String Processing Algorithms proposal (Clean up based on the String Processing Algorithms proposal swiftlang/swift-experimental-string-processing#247)
Move CharacterClass API into RegexBuilder (Move CharacterClass API into RegexBuilder swiftlang/swift-experimental-string-processing#254)
Eliminate extra public API (Eliminate extra public API swiftlang/swift-experimental-string-processing#256)
Update regex syntax pitch (Update regex syntax pitch swiftlang/swift-experimental-string-processing#258)
Throwing customization hooks (Throwing customization hooks swiftlang/swift-experimental-string-processing#261)
Nominalize API names (Nominalize API names swiftlang/swift-experimental-string-processing#271)
Add SwiftStdlib 5.7 availability (Add SwiftStdlib 5.7 availability swiftlang/swift-experimental-string-processing#276)
Rename RegexComponent.Output (Rename RegexComponent.Output swiftlang/swift-experimental-string-processing#281)
Move RegexComponent conformances to RegexBuilder (Move RegexComponent conformances to RegexBuilder swiftlang/swift-experimental-string-processing#279)
Merge pull request Add remaining availability annotations. swiftlang/swift-experimental-string-processing#283 from rxwei/fix-availability
Add Substring algorithms tests (Add Substring algorithms tests swiftlang/swift-experimental-string-processing#289)
Merge pull request Throwing matches and update to CustomMatchingRegexComponent swiftlang/swift-experimental-string-processing#273 from itingliu/throwing-hooks
RegexBuilder quantifiers take an optional behavior (Make RegexBuilder quantifiers follow option settings swiftlang/swift-experimental-string-processing#293)
Nominalize option methods (Nominalize option methods swiftlang/swift-experimental-string-processing#295)
Merge pull request Import _RegexParser as implementation only swiftlang/swift-experimental-string-processing#287 from apple/impl-import
Remove compiling argument label
Expose matches, ranges and split (Expose matches, ranges and split swiftlang/swift-experimental-string-processing#304)
Fix HexDigit definition in RegexSyntax.md
Remove AST CustomCharacterClass consumer generation
Convert scalar escape sequences to DSL scalars
Allow custom character classes to begin with :
Allow POSIX character properties outside of custom character classes
Fix character class trivia matching
Fix trivia parsing for set operations and initial ] cases
Throw error if we encounter stray opening '('
Change matching option scoping behavior to match PCRE
Error on unknown character properties
Don't parse a character property containing a backslash
Adds RegexBuilder.CharacterClass.anyUnicodeScalar (Adds RegexBuilder.CharacterClass.anyUnicodeScalar swiftlang/swift-experimental-string-processing#315)
Allow setting any of the three quant behaviors (Allow setting any of the three quant behaviors swiftlang/swift-experimental-string-processing#311)
Fixup for missing AST import separation
Updates for algorithms proposal (Updates for algorithms proposal swiftlang/swift-experimental-string-processing#319)
Rename CustomPrefixMatchRC to CustomConsumingRegexComponent

* Add string processing algorithms pitch Co-authored-by: Tim Vermeulen <tvermeulen@apple.com> Co-authored-by: Michael Ilseman <michael.ilseman@gmail.com>

Move the regex builder DSL (except `RegexComponent`) to a new module named RegexBuilder. The DSL depends on `DSLTree` and a few other supporting types, so those types have been made `_spi(RegexBuilder) public`. The SPI establishes an ABI between `_StringProcessing` and `RegexBuilder`, but I don't think it's a concern because the two modules will co-evolve and both will be rebuilt for every release.

[Integration] main (bb1f34a) -> swift/main

…cessing into main-integration-d2ff78f6

[Integration] main (d2ff78f) -> swift/main

This adds methods to RegexComponent for the remainder of the regex options, and passes the current MatchingOptions further down into the consumers so that the correct behavior can be used.

…g#247) * Clean up based on the String Processing Algorithms proposal - Move functions and types that have not been proposed from public to internal - Add doc comments for public API - Add FIXME for API awaiting SE-0346 - Replace `_MatchResult` with `Regex<Output>.Match` and update tests Co-authored-by: Richard Wei <rxrwei@gmail.com>

Makes the existing CharacterClass model type SPI, and adds a public CharacterClass type to the RegexBuilder module, which uses a DSLTree char class instead of the AST's version. RegexBuilder.CharacterClass is a more limited API than we need for the internal character class model, giving us room to expand on it as necessary in the future.

These were mostly leftover bits from testing 👋

* Update regex syntax pitch * Rename file

* Throwing customization hooks * Adds test to try out throwing custom code. * Adds processor support. * Add throws to capture transform API, plumbing * Remove non-failable try-capture overloads

Go from matchWhole -> wholeMatch(in:), which is more consistent with firstMatch etc.

Add availability to public, SPI, and test symbols.

Add remaining availability annotations.

`firstMatch(of:)` was ignoring the start/endIndex when searching in substrings; this change fixes that issue. Also adds the 'in' label to `Regex.firstMatch(in:Substring)` to match the rest of the related APIs.

Throwing matches and update to CustomMatchingRegexComponent

Import _RegexParser as implementation only

Remove compiling argument label

[5.7] Integrate API changes into release/5.7

* Expose `matches`, `ranges` and `split` Publicize these API per the String Processing Algorithms proposal. The proposed ones return generic `Collection`, empowered by SE-0346. For now we'll wrap the results with a concrete `Array` until the language feature is ready. Co-authored-by: Michael Ilseman <michael.ilseman@gmail.com>

This isn't actually used, as we convert to a DSL custom character class, and then use that consumer logic.

Convert AST escape sequences that represent a scalar value (e.g `\f`, `n`, `\a`) into scalars in the DSL tree. This allows the matching engine to match against them.

ICU and Oniguruma allow custom character classes to begin with `:`, and only lex a POSIX character property if they detect a closing `:]`. However their behavior for this differs: - ICU will consider *any* `:]` in the regex as a closing delimiter, even e.g `[[:a]][:]`. - Oniguruma will stop if it hits a `]`, so `[[:a]][:]` is treated as a custom character class. However it only scans ahead 20 chars max, and doesn't stop for e.g a nested character class opening `[`. Our detection behavior for this is as follows: - When `[:` is encountered inside a custom character class, scan ahead to the closing `:]`. - While scanning, bail if we see any characters that are obviously invalid property names. Currently this includes `[`, `]`, `}`, as well as a second occurrence of `=`. - Otherwise, if we end on `:]`, consider that a POSIX character property. We could include more metacharacters to bail on, e.g `{`, `(`, `)`, but for now I'm tending on the side of lexing an invalid POSIX character property. We can always relax this in the future (as we'd be turning invalid code into valid code). Users can always escape the initial `:` in `[:` if they want a custom character class. In fact, we may want to suggest this via a warning, as this behavior can be pretty subtle.

This matches the ICU behavior, and appears to be suggested by UTS#18.

Rather than matching and not advancing the input, we should always return `nil` to never match against the trivia.

Previously we would check for an empty array of members when deciding whether an initial `]` is literal, or if the operands of a set operation are invalid. Switch to checking whether we have any semantic members instead.

This should be unreachable, let's make sure of that. Doing so requires generalizing the handling of LocatedError a bit.

Previously we would always parse a "change matching option" sequence as a group, and for the isolated syntax e.g `(?x)`, we would have it implicitly wrap everything after in the same group by having it do a recursive parse. This matched the Oniguruma behavior for such isolated groups, and was easy to implement, but its behavior is quite unintuitive when it comes to alternations, as e.g `a(?x)b|c` becomes `a(?x:b|c)`, which may not be expected. Instead, let's follow PCRE's behavior by having such isolated cases affect the syntax options for the remainder of the current group, including across alternation branches. This is done by lexing such cases as atoms (as they aren't really group-like anymore), and having them change the syntax options when we encounter them. The existing scoping rules around groups take care of resetting the options when we exit the scope.

Previously we would form an `.other` character property kind for any unclassified properties, which crash at runtime as unsupported. Instead, switch to erroring on them. Eventually it would be nice if we could version this based on what the runtime being targeted supports.

Add backslash to the list of characters we don't consider valid for a character property name. This means that we'll bail when attempting to lex a POSIX character property and instead lex a custom character class. This allows e.g `[:\Q :] \E]` to be lexed as a custom character class. For `\p{...}` this just means we'll emit a truncated invalid property error, which is arguably more inline with what the user was expecting.. I noticed when digging through the ICU source code that it will bail out of parsing a POSIX character property if it encounters one of its known escape sequences (e.g `\a`, `\e`, `\f`, ...). Interestingly this doesn't cover character property escapes e.g `\d`, but it's not clear that is intentional. Given backslash is not a valid character property character anyway, it seems reasonable to broaden this behavior to bail on any backslash.

This provides a RegexBuilder API that represents the same as `\O` in regex syntax.

This also moves QuantificationBehavior from the RegexBuilder module down to _StringProcessing, and renames it to RegexRepetitionBehavior.

This will go back when 182da3b is merged into the 5.7 branch.

[5.7] Expose `matches`, `ranges` and `split`

[5.7] Update API for congruence with Unicode proposal

* Rename custom match prefix protocol and add doc comments * Update algo proposal prose

rxwei and others added 30 commits March 21, 2022 20:19

Merge pull request swiftlang#225 from rxwei/main-integration-50ec05d

93a894e

Pitch: String processing algorithms (swiftlang#188)

45ab195

* Add string processing algorithms pitch Co-authored-by: Tim Vermeulen <tvermeulen@apple.com> Co-authored-by: Michael Ilseman <michael.ilseman@gmail.com>

Merge pull request swiftlang#231 from rxwei/main-integration-bb1f34a

79066a8

[Integration] main (bb1f34a) -> swift/main

Merge branch 'main' of github.com:apple/swift-experimental-string-pro…

9e330ba

…cessing into main-integration-d2ff78f6

Merge pull request swiftlang#235 from rxwei/main-integration-d2ff78f6

044be96

[Integration] main (d2ff78f) -> swift/main

Merge branch 'main' into main-merge

a989eae

Merge pull request swiftlang#244 from hamishknight/main-merge

b583909

Fill out remainder of options API (swiftlang#246)

1a96ea8

This adds methods to RegexComponent for the remainder of the regex options, and passes the current MatchingOptions further down into the consumers so that the correct behavior can be used.

Eliminate extra public API (swiftlang#256)

cc91315

These were mostly leftover bits from testing 👋

Update regex syntax pitch (swiftlang#258)

b86ca70

* Update regex syntax pitch * Rename file

Throwing customization hooks (swiftlang#261)

e2e3d63

* Throwing customization hooks * Adds test to try out throwing custom code. * Adds processor support. * Add throws to capture transform API, plumbing * Remove non-failable try-capture overloads

Nominalize API names (swiftlang#271)

57d8db7

Go from matchWhole -> wholeMatch(in:), which is more consistent with firstMatch etc.

Add SwiftStdlib 5.7 availability (swiftlang#276)

3f63265

Add availability to public, SPI, and test symbols.

Rename RegexComponent.Output (swiftlang#281)

f144abc

Move RegexComponent conformances to RegexBuilder (swiftlang#279)

315c418

Merge pull request swiftlang#283 from rxwei/fix-availability

ba032b6

Add remaining availability annotations.

Add Substring algorithms tests (swiftlang#289)

fde4c58

`firstMatch(of:)` was ignoring the start/endIndex when searching in substrings; this change fixes that issue. Also adds the 'in' label to `Regex.firstMatch(in:Substring)` to match the rest of the related APIs.

Merge pull request swiftlang#273 from itingliu/throwing-hooks

d002466

Throwing matches and update to CustomMatchingRegexComponent

RegexBuilder quantifiers take an optional behavior (swiftlang#293)

3c43286

Nominalize option methods (swiftlang#295)

0d41bb2

Merge pull request swiftlang#287 from apple/impl-import

3cd65cd

Import _RegexParser as implementation only

Remove compiling argument label

51756fb

Merge pull request #1 from milseman/5_7_azoy

2d9de48

Remove compiling argument label

Merge pull request swiftlang#298 from Azoy/da-api-mon

115a937

[5.7] Integrate API changes into release/5.7

Fix HexDigit definition in RegexSyntax.md

3f2832d

Remove AST CustomCharacterClass consumer generation

3cce15d

This isn't actually used, as we convert to a DSL custom character class, and then use that consumer logic.

hamishknight and others added 17 commits April 21, 2022 18:03

Convert scalar escape sequences to DSL scalars

577dc6e

Convert AST escape sequences that represent a scalar value (e.g `\f`, `n`, `\a`) into scalars in the DSL tree. This allows the matching engine to match against them.

Allow POSIX character properties outside of custom character classes

5912ab4

This matches the ICU behavior, and appears to be suggested by UTS#18.

Fix character class trivia matching

c638486

Rather than matching and not advancing the input, we should always return `nil` to never match against the trivia.

Fix trivia parsing for set operations and initial ] cases

f053dc3

Previously we would check for an empty array of members when deciding whether an initial `]` is literal, or if the operands of a set operation are invalid. Switch to checking whether we have any semantic members instead.

Throw error if we encounter stray opening '('

e84c93d

This should be unreachable, let's make sure of that. Doing so requires generalizing the handling of LocatedError a bit.

Adds RegexBuilder.CharacterClass.anyUnicodeScalar (swiftlang#315)

b8a1a81

This provides a RegexBuilder API that represents the same as `\O` in regex syntax.

Allow setting any of the three quant behaviors (swiftlang#311)

82fcf4a

This also moves QuantificationBehavior from the RegexBuilder module down to _StringProcessing, and renames it to RegexRepetitionBehavior.

Fixup for missing AST import separation

eba0393

This will go back when 182da3b is merged into the 5.7 branch.

Merge pull request swiftlang#313 from Azoy/matches-ranges-split

dad77c5

[5.7] Expose `matches`, `ranges` and `split`

Merge pull request swiftlang#316 from natecook1000/unicode_api_5.7

fc46753

[5.7] Update API for congruence with Unicode proposal

Merge pull request swiftlang#309 from hamishknight/parser-changes-5.7

29bc5da

Updates for algorithms proposal (swiftlang#319)

2745de2

* Rename custom match prefix protocol and add doc comments * Update algo proposal prose

Rename CustomPrefixMatchRC to CustomConsumingRegexComponent

978cce1

itingliu closed this Apr 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revised algorithm 5 7 #2

revised algorithm 5 7 #2

itingliu commented Apr 22, 2022

revised algorithm 5 7 #2

revised algorithm 5 7 #2

Conversation

itingliu commented Apr 22, 2022