Skip to content

revised algorithm 5 7 #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 47 commits into from
Closed

revised algorithm 5 7 #2

wants to merge 47 commits into from

Conversation

itingliu
Copy link
Owner

rxwei and others added 30 commits March 21, 2022 20:19
* Add string processing algorithms pitch

Co-authored-by: Tim Vermeulen <tvermeulen@apple.com>
Co-authored-by: Michael Ilseman <michael.ilseman@gmail.com>
Move the regex builder DSL (except `RegexComponent`) to a new module named RegexBuilder. The DSL depends on `DSLTree` and a few other supporting types, so those types have been made `_spi(RegexBuilder) public`. The SPI establishes an ABI between `_StringProcessing` and `RegexBuilder`, but I don't think it's a concern because the two modules will co-evolve and both will be rebuilt for every release.
This adds methods to RegexComponent for the remainder of the regex
options, and passes the current MatchingOptions further down into
the consumers so that the correct behavior can be used.
…g#247)

* Clean up based on the String Processing Algorithms proposal

- Move functions and types that have not been proposed from public to internal
- Add doc comments for public API
- Add FIXME for API awaiting SE-0346
- Replace `_MatchResult` with `Regex<Output>.Match` and update tests

Co-authored-by: Richard Wei <rxrwei@gmail.com>
Makes the existing CharacterClass model type SPI, and adds a public
CharacterClass type to the RegexBuilder module, which uses a DSLTree char class
instead of the AST's version. RegexBuilder.CharacterClass is a more limited
API than we need for the internal character class model, giving us room to
expand on it as necessary in the future.
These were mostly leftover bits from testing 👋
* Update regex syntax pitch

* Rename file
* Throwing customization hooks

* Adds test to try out throwing custom code. 

* Adds processor support.

* Add throws to capture transform API, plumbing

* Remove non-failable try-capture overloads
Go from matchWhole -> wholeMatch(in:), which is more consistent with firstMatch etc.
Add availability to public, SPI, and test symbols.
Add remaining availability annotations.
`firstMatch(of:)` was ignoring the start/endIndex when searching in
substrings; this change fixes that issue. Also adds the 'in' label
to `Regex.firstMatch(in:Substring)` to match the rest of the related
APIs.
Throwing matches and update to CustomMatchingRegexComponent
Import _RegexParser as implementation only
Remove compiling argument label
[5.7] Integrate API changes into release/5.7
* Expose `matches`, `ranges` and `split`

Publicize these API per the String Processing Algorithms proposal. The proposed
ones return generic `Collection`, empowered by SE-0346. For now we'll wrap the
results with a concrete `Array` until the language feature is ready.

Co-authored-by: Michael Ilseman <michael.ilseman@gmail.com>
This isn't actually used, as we convert to a DSL
custom character class, and then use that consumer
logic.
hamishknight and others added 17 commits April 21, 2022 18:03
Convert AST escape sequences that represent a
scalar value (e.g `\f`, `n`, `\a`) into scalars in
the DSL tree. This allows the matching engine to
match against them.
ICU and Oniguruma allow custom character classes to
begin with `:`, and only lex a POSIX character
property if they detect a closing `:]`. However
their behavior for this differs:

- ICU will consider *any* `:]` in the regex as a
closing delimiter, even e.g `[[:a]][:]`.

- Oniguruma will stop if it hits a `]`, so
`[[:a]][:]` is treated as a custom character class.
However it only scans ahead 20 chars max, and doesn't
stop for e.g a nested character class opening `[`.

Our detection behavior for this is as follows:

- When `[:` is encountered inside a custom character
class, scan ahead to the closing `:]`.
- While scanning, bail if we see any characters
that are obviously invalid property names. Currently
this includes `[`, `]`, `}`, as well as a second
occurrence of `=`.
- Otherwise, if we end on `:]`, consider that a
POSIX character property.

We could include more metacharacters to bail on,
e.g `{`, `(`, `)`, but for now I'm tending on the
side of lexing an invalid POSIX character property.
We can always relax this in the future (as we'd be
turning invalid code into valid code). Users can
always escape the initial `:` in `[:` if they want
a custom character class. In fact, we may want to
suggest this via a warning, as this behavior can
be pretty subtle.
This matches the ICU behavior, and appears to be
suggested by UTS#18.
Rather than matching and not advancing the input,
we should always return `nil` to never match
against the trivia.
Previously we would check for an empty array of
members when deciding whether an initial `]` is
literal, or if the operands of a set operation are
invalid. Switch to checking whether we have any
semantic members instead.
This should be unreachable, let's make sure of
that. Doing so requires generalizing the handling
of LocatedError a bit.
Previously we would always parse a
"change matching option" sequence as a group, and
for the isolated syntax e.g `(?x)`, we would have
it implicitly wrap everything after in the same
group by having it do a recursive parse.

This matched the Oniguruma behavior for such
isolated groups, and was easy to implement, but
its behavior is quite unintuitive when it comes
to alternations, as e.g `a(?x)b|c` becomes
`a(?x:b|c)`, which may not be expected.

Instead, let's follow PCRE's behavior by having
such isolated cases affect the syntax options for
the remainder of the current group, including
across alternation branches. This is done by
lexing such cases as atoms (as they aren't really
group-like anymore), and having them change the
syntax options when we encounter them. The existing
scoping rules around groups take care of resetting
the options when we exit the scope.
Previously we would form an `.other` character
property kind for any unclassified properties,
which crash at runtime as unsupported. Instead,
switch to erroring on them. Eventually it would be
nice if we could version this based on what the
runtime being targeted supports.
Add backslash to the list of characters we don't
consider valid for a character property name. This
means that we'll bail when attempting to lex a
POSIX character property and instead lex a custom
character class. This allows e.g `[:\Q :] \E]` to
be lexed as a custom character class. For `\p{...}`
this just means we'll emit a truncated invalid
property error, which is arguably more inline with
what the user was expecting..

I noticed when digging through the ICU source code
that it will bail out of parsing a POSIX character
property if it encounters one of its known escape
sequences (e.g `\a`, `\e`, `\f`, ...). Interestingly
this doesn't cover character property escapes e.g
`\d`, but it's not clear that is intentional. Given
backslash is not a valid character property character
anyway, it seems reasonable to broaden this behavior
to bail on any backslash.
This provides a RegexBuilder API that represents the same as `\O`
in regex syntax.
This also moves QuantificationBehavior from the RegexBuilder module
down to _StringProcessing, and renames it to RegexRepetitionBehavior.
This will go back when 182da3b is
merged into the 5.7 branch.
[5.7] Expose `matches`, `ranges` and `split`
[5.7] Update API for congruence with Unicode proposal
* Rename custom match prefix protocol and add doc comments

* Update algo proposal prose
@itingliu itingliu closed this Apr 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants