Escape sequence + empty char class tweaks #226

hamishknight · 2022-03-21T19:34:51Z

Start throwing an error on unknown a-zA-Z and non-ASCII non-whitespace escape sequences.
Start allowing more escape sequences that denote Unicode scalars in custom character class ranges.
Forbid empty character classes, ] as the first character is instead treated as literal, for consistency with PCRE, Oniguruma, and ICU.

milseman

LGTM

milseman · 2022-03-28T17:19:52Z

Sources/_MatchingEngine/Regex/AST/Atom.swift

@@ -641,17 +641,67 @@ extension AST.Atom {
    case .scalar(let s):
      return Character(s)

+    case .escaped(let c):
+      switch c {
+      // TODO: Should we separate these into a separate enum? Or move the


LGTM, but feel free to look first at the use sites of literalCharacterValue to see if there's really two different questions an AST client is asking. There might be a literal value for the purposes of character class treatment vs a literal value for the purposes of performing the match.

literalCharacterValue is currently only really used for building a model character class, and not for regular matching. It seems like we might want to use these scalars for both though, I actually wonder whether we should convert these escape sequences into a .scalar DSL atoms? Because ultimately it seems like these are just syntactic sugar for unicode scalar sequences, and it would allow them to be supported by the matching engine. Any thoughts?

Throw an error for unknown a-z escape sequences as well as non-ASCII non-whitespace escape sequences.

Certain escape sequences express a unicode scalar and as such are valid in a range.

This is now done from the DSLTree.

As per PCRE, Oniguruma, and ICU, a first character of `]` is treated as literal.

milseman · 2022-03-30T15:55:44Z

literalCharacterValue is currently only really used for building a model character class, and not for regular matching. It seems like we might want to use these scalars for both though, I actually wonder whether we should convert these escape sequences into a .scalar DSL atoms? Because ultimately it seems like these are just syntactic sugar for unicode scalar sequences, and it would allow them to be supported by the matching engine. Any thoughts?

Argh, won't let me just reply in-line. I think converting anything that is literally matching a scalar value to a .scalar DSL atom makes sense to me. We'd want to make sure we're not making assumptions about options and character classes though, CC @natecook1000

hamishknight · 2022-03-31T16:20:22Z

@swift-ci please test

hamishknight · 2022-03-31T16:23:55Z

Going to land this so I can land more stuff on top of it, I can work on converting to scalars in a follow-up

hamishknight requested a review from milseman March 21, 2022 19:36

milseman approved these changes Mar 28, 2022

View reviewed changes

hamishknight added 4 commits March 30, 2022 12:11

Error on unknown escape sequences

d3bd6ad

Throw an error for unknown a-z escape sequences as well as non-ASCII non-whitespace escape sequences.

Allow certain escape sequences in character class ranges

5a52d53

Certain escape sequences express a unicode scalar and as such are valid in a range.

Remove obsolete CharacterClass model computation

692f0fd

This is now done from the DSLTree.

Forbid empty character classes

cdf98c5

As per PCRE, Oniguruma, and ICU, a first character of `]` is treated as literal.

hamishknight force-pushed the esc branch from 7cc4556 to cdf98c5 Compare March 30, 2022 11:12

hamishknight merged commit 9889ae7 into swiftlang:main Mar 31, 2022

hamishknight deleted the esc branch March 31, 2022 16:24

hamishknight mentioned this pull request Apr 4, 2022

Convert scalar escape sequences to DSL scalars #245

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Escape sequence + empty char class tweaks #226

Escape sequence + empty char class tweaks #226

Uh oh!

hamishknight commented Mar 21, 2022

Uh oh!

milseman left a comment

Uh oh!

milseman Mar 28, 2022

Uh oh!

hamishknight Mar 29, 2022

Uh oh!

milseman commented Mar 30, 2022

Uh oh!

hamishknight commented Mar 31, 2022

Uh oh!

hamishknight commented Mar 31, 2022

Uh oh!

Uh oh!

Escape sequence + empty char class tweaks #226

Escape sequence + empty char class tweaks #226

Uh oh!

Conversation

hamishknight commented Mar 21, 2022

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

milseman Mar 28, 2022

Choose a reason for hiding this comment

Uh oh!

hamishknight Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

milseman commented Mar 30, 2022

Uh oh!

hamishknight commented Mar 31, 2022

Uh oh!

hamishknight commented Mar 31, 2022

Uh oh!

Uh oh!