You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexSyntaxRunTimeConstruction.md
+19-2
Original file line number
Diff line number
Diff line change
@@ -392,7 +392,7 @@ For non-Unicode properties, only a value is required. These include:
392
392
- The special PCRE2 properties `Xan`, `Xps`, `Xsp`, `Xuc`, `Xwd`.
393
393
- The special Java properties `javaLowerCase`, `javaUpperCase`, `javaWhitespace`, `javaMirrored`.
394
394
395
-
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`.
395
+
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`. Both spellings may be used inside and outside of a custom character class.
396
396
397
397
#### `\K`
398
398
@@ -534,6 +534,7 @@ These operators have a lower precedence than the implicit union of members, e.g
534
534
535
535
To avoid ambiguity between .NET's subtraction syntax and range syntax, .NET specifies that a subtraction will only be parsed if the right-hand-side is a nested custom character class. We propose following this behavior.
536
536
537
+
Note that a custom character class may begin with the `:` character, and only becomes a POSIX character property if a closing `:]` is present. For example, `[:a]` is the character class of `:` and `a`.
537
538
538
539
### Matching options
539
540
@@ -863,7 +864,23 @@ PCRE supports `\N` meaning "not a newline", however there are engines that treat
863
864
864
865
### Extended character property syntax
865
866
866
-
ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`, such that they follow the same internal grammar, which allows referencing any Unicode character property in addition to the POSIX properties. We propose supporting this, though it is a purely additive feature, and therefore should not conflict with regex engines that implement a more limited POSIX syntax.
867
+
ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`. This has two effects:
868
+
869
+
- They share the same internal grammar, which allows the use of any Unicode character properties in addition to the POSIX properties.
870
+
- The POSIX syntax may be used outside of custom character classes, unlike in PCRE and Oniguruma.
871
+
872
+
We propose following both of these rules. The former is purely additive, and therefore should not conflict with regex engines that implement a more limited POSIX syntax. The latter does conflict with other engines, but we feel it is much more likely that a user would expect e.g `[:space:]` to be a character property rather than the character class `[:aceps]`. We do however feel that a warning might be warranted in order to avoid confusion.
873
+
874
+
### POSIX character property disambiguation
875
+
876
+
PCRE, Oniguruma and ICU allow `[:` to be part of a custom character class if a closing `:]` is not present. For example, `[:a]` is the character class of `:` and `a`. However they each have different rules for detecting the closing `:]`:
877
+
878
+
- PCRE will scan ahead until it hits either `:]`, `]`, or `[:`.
879
+
- Oniguruma will scan ahead until it hits either `:]`, `]`, or the length exceeds 20 characters.
880
+
- ICU will scan ahead until it hits a known escape sequence (e.g `\a`, `\e`, `\Q`, ...), or `:]`. Note this excludes character class escapes e.g `\d`. It also excludes `]`, meaning that even `[:a][:]` is parsed as a POSIX character property.
881
+
882
+
We propose unifying these behaviors by scanning ahead until we hit either `[`, `]`, `:]`, or `\`. Additionally, we will stop on encountering `}` or a second occurrence of `=`. These fall out the fact that they would be invalid contents of the alternative `\p{...}` syntax.
0 commit comments