Skip to content

Commit 81f1350

Browse files
committed
Don't parse a character property containing a backslash
Add backslash to the list of characters we don't consider valid for a character property name. This means that we'll bail when attempting to lex a POSIX character property and instead lex a custom character class. This allows e.g `[:\Q :] \E]` to be lexed as a custom character class. For `\p{...}` this just means we'll emit a truncated invalid property error, which is arguably more inline with what the user was expecting.. I noticed when digging through the ICU source code that it will bail out of parsing a POSIX character property if it encounters one of its known escape sequences (e.g `\a`, `\e`, `\f`, ...). Interestingly this doesn't cover character property escapes e.g `\d`, but it's not clear that is intentional. Given backslash is not a valid character property character anyway, it seems reasonable to broaden this behavior to bail on any backslash.
1 parent 3226eec commit 81f1350

File tree

2 files changed

+28
-1
lines changed

2 files changed

+28
-1
lines changed

Sources/_RegexParser/Regex/Parse/LexicalAnalysis.swift

+8
Original file line numberDiff line numberDiff line change
@@ -1176,6 +1176,14 @@ extension Source {
11761176
// character property name anyway, and it's nice not to have diverging
11771177
// logic for these cases.
11781178
return true
1179+
case "\\":
1180+
// An escape sequence, which may include e.g '\Q :] \E'. ICU bails here
1181+
// for all its known escape sequences (e.g '\a', '\e' '\f', ...). It
1182+
// seems character class escapes e.g '\d' are excluded, however it's not
1183+
// clear that is intentional. Let's apply the rule for any escape, as a
1184+
// backslash would never be a valid character property name, and we can
1185+
// diagnose any invalid escapes when parsing as a character class.
1186+
return true
11791187
default:
11801188
// We may want to handle other metacharacters here, e.g '{', '(', ')',
11811189
// as they're not valid character property names. However for now

Tests/RegexTests/ParseTests.swift

+20-1
Original file line numberDiff line numberDiff line change
@@ -503,6 +503,25 @@ extension RegexTests {
503503
parseTest(#"[[:{]]"#, charClass(charClass(":", "{")))
504504
parseTest(#"[[:}:]]"#, charClass(charClass(":", "}", ":")))
505505

506+
parseTest(
507+
#"[:[:space:]:]"#,
508+
charClass(":", posixProp_m(.binary(.whitespace)), ":")
509+
)
510+
parseTest(
511+
#"[:a[:space:]b:]"#,
512+
charClass(":", "a", posixProp_m(.binary(.whitespace)), "b", ":")
513+
)
514+
515+
// ICU parses a custom character class if it sees any of its known escape
516+
// sequences in a POSIX character property (though it appears to exclude
517+
// character class escapes e.g '\d'). We do so for any escape sequence as
518+
// '\' is not a valid character property character.
519+
parseTest(#"[:\Q:]\E]"#, charClass(":", quote_m(":]")))
520+
parseTest(#"[:\a:]"#, charClass(":", atom_m(.escaped(.alarm)), ":"))
521+
parseTest(#"[:\d:]"#, charClass(":", atom_m(.escaped(.decimalDigit)), ":"))
522+
parseTest(#"[:\\:]"#, charClass(":", "\\", ":"))
523+
parseTest(#"[:\:]"#, charClass(":", ":"))
524+
506525
parseTest(
507526
#"\D\S\W"#,
508527
concat(
@@ -2319,7 +2338,7 @@ extension RegexTests {
23192338
diagnosticTest(#"\p{x=y}"#, .unknownProperty(key: "x", value: "y"))
23202339
diagnosticTest(#"\p{aaa(b)}"#, .unknownProperty(key: nil, value: "aaa(b)"))
23212340
diagnosticTest("[[:a():]]", .unknownProperty(key: nil, value: "a()"))
2322-
diagnosticTest(#"\p{aaa\p{b}}"#, .unknownProperty(key: nil, value: #"aaa\p{b"#))
2341+
diagnosticTest(#"\p{aaa\p{b}}"#, .unknownProperty(key: nil, value: "aaa"))
23232342
diagnosticTest(#"[[:{:]]"#, .unknownProperty(key: nil, value: "{"))
23242343

23252344
// MARK: Matching options

0 commit comments

Comments
 (0)