Skip to content

Escape sequence + empty char class tweaks #226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 31, 2022

Conversation

hamishknight
Copy link
Contributor

  • Start throwing an error on unknown a-zA-Z and non-ASCII non-whitespace escape sequences.
  • Start allowing more escape sequences that denote Unicode scalars in custom character class ranges.
  • Forbid empty character classes, ] as the first character is instead treated as literal, for consistency with PCRE, Oniguruma, and ICU.

@hamishknight hamishknight requested a review from milseman March 21, 2022 19:36
Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -641,17 +641,67 @@ extension AST.Atom {
case .scalar(let s):
return Character(s)

case .escaped(let c):
switch c {
// TODO: Should we separate these into a separate enum? Or move the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but feel free to look first at the use sites of literalCharacterValue to see if there's really two different questions an AST client is asking. There might be a literal value for the purposes of character class treatment vs a literal value for the purposes of performing the match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

literalCharacterValue is currently only really used for building a model character class, and not for regular matching. It seems like we might want to use these scalars for both though, I actually wonder whether we should convert these escape sequences into a .scalar DSL atoms? Because ultimately it seems like these are just syntactic sugar for unicode scalar sequences, and it would allow them to be supported by the matching engine. Any thoughts?

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Throw an error for unknown a-z escape sequences
as well as non-ASCII non-whitespace escape
sequences.
Certain escape sequences express a unicode scalar
and as such are valid in a range.
This is now done from the DSLTree.
As per PCRE, Oniguruma, and ICU, a first character
of `]` is treated as literal.
@milseman
Copy link
Member

literalCharacterValue is currently only really used for building a model character class, and not for regular matching. It seems like we might want to use these scalars for both though, I actually wonder whether we should convert these escape sequences into a .scalar DSL atoms? Because ultimately it seems like these are just syntactic sugar for unicode scalar sequences, and it would allow them to be supported by the matching engine. Any thoughts?

Argh, won't let me just reply in-line. I think converting anything that is literally matching a scalar value to a .scalar DSL atom makes sense to me. We'd want to make sure we're not making assumptions about options and character classes though, CC @natecook1000

@hamishknight
Copy link
Contributor Author

@swift-ci please test

@hamishknight
Copy link
Contributor Author

Going to land this so I can land more stuff on top of it, I can work on converting to scalars in a follow-up

@hamishknight hamishknight merged commit 9889ae7 into swiftlang:main Mar 31, 2022
@hamishknight hamishknight deleted the esc branch March 31, 2022 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants