Skip to content

Allow custom character classes to begin with : #269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 12, 2022

Conversation

hamishknight
Copy link
Contributor

ICU and Oniguruma allow custom character classes to begin with :, and only lex a POSIX character property if they detect a closing :]. However their behavior for this differs:

  • ICU will consider any :] in the regex as a closing delimiter, even e.g [[:a]][:].
  • Oniguruma will stop if it hits a ], so [[:a]][:] is treated as a custom character class. However it only scans ahead 20 chars max, and doesn't stop for e.g a nested character class opening [.

Our detection behavior for this is as follows:

  • When [: is encountered inside a custom character class, scan ahead to the closing :].
  • While scanning, bail if we see any characters that are obviously invalid property names. Currently this includes [, ], }, as well as a second occurrence of =.
  • Otherwise, if we end on :], consider that a POSIX character property.

We could include more metacharacters to bail on, e.g {, (, ), but for now I'm tending on the side of lexing an invalid POSIX character property. We can always relax this in the future (as we'd be turning invalid code into valid code). Users can always escape the initial : in [: if they want a custom character class. In fact, we may want to suggest this via a warning, as this behavior can be pretty subtle.

ICU and Oniguruma allow custom character classes to
begin with `:`, and only lex a POSIX character
property if they detect a closing `:]`. However
their behavior for this differs:

- ICU will consider *any* `:]` in the regex as a
closing delimiter, even e.g `[[:a]][:]`.

- Oniguruma will stop if it hits a `]`, so
`[[:a]][:]` is treated as a custom character class.
However it only scans ahead 20 chars max, and doesn't
stop for e.g a nested character class opening `[`.

Our detection behavior for this is as follows:

- When `[:` is encountered inside a custom character
class, scan ahead to the closing `:]`.
- While scanning, bail if we see any characters
that are obviously invalid property names. Currently
this includes `[`, `]`, `}`, as well as a second
occurrence of `=`.
- Otherwise, if we end on `:]`, consider that a
POSIX character property.

We could include more metacharacters to bail on,
e.g `{`, `(`, `)`, but for now I'm tending on the
side of lexing an invalid POSIX character property.
We can always relax this in the future (as we'd be
turning invalid code into valid code). Users can
always escape the initial `:` in `[:` if they want
a custom character class. In fact, we may want to
suggest this via a warning, as this behavior can
be pretty subtle.
@hamishknight
Copy link
Contributor Author

@swift-ci please test

@hamishknight hamishknight requested a review from milseman April 12, 2022 16:48
Comment on lines +1119 to +1125
// TODO: Start erroring on these?
parseTest(#"\p{Lx}"#, prop(.other(key: nil, value: "Lx")))
parseTest(#"\p{gcL}"#, prop(.other(key: nil, value: "gcL")))
parseTest(#"\p{x=y}"#, prop(.other(key: "x", value: "y")))
parseTest(#"\p{aaa(b)}"#, prop(.other(key: nil, value: "aaa(b)")))
parseTest("[[:a():]]", charClass(posixProp_m(.other(key: nil, value: "a()"))))
parseTest(#"\p{aaa\p{b}}"#, concat(prop(.other(key: nil, value: #"aaa\p{b"#)), "}"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milseman Am I right in saying we'll want to start erroring on these unknown cases?

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants