Skip to content

Commit 08b7808

Browse files
authored
Merge pull request #302 from hamishknight/update-syntax
2 parents 8068ea1 + fa5f2f1 commit 08b7808

File tree

1 file changed

+19
-2
lines changed

1 file changed

+19
-2
lines changed

Documentation/Evolution/RegexSyntaxRunTimeConstruction.md

+19-2
Original file line numberDiff line numberDiff line change
@@ -392,7 +392,7 @@ For non-Unicode properties, only a value is required. These include:
392392
- The special PCRE2 properties `Xan`, `Xps`, `Xsp`, `Xuc`, `Xwd`.
393393
- The special Java properties `javaLowerCase`, `javaUpperCase`, `javaWhitespace`, `javaMirrored`.
394394

395-
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`.
395+
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`. Both spellings may be used inside and outside of a custom character class.
396396

397397
#### `\K`
398398

@@ -534,6 +534,7 @@ These operators have a lower precedence than the implicit union of members, e.g
534534

535535
To avoid ambiguity between .NET's subtraction syntax and range syntax, .NET specifies that a subtraction will only be parsed if the right-hand-side is a nested custom character class. We propose following this behavior.
536536

537+
Note that a custom character class may begin with the `:` character, and only becomes a POSIX character property if a closing `:]` is present. For example, `[:a]` is the character class of `:` and `a`.
537538

538539
### Matching options
539540

@@ -863,7 +864,23 @@ PCRE supports `\N` meaning "not a newline", however there are engines that treat
863864

864865
### Extended character property syntax
865866

866-
ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`, such that they follow the same internal grammar, which allows referencing any Unicode character property in addition to the POSIX properties. We propose supporting this, though it is a purely additive feature, and therefore should not conflict with regex engines that implement a more limited POSIX syntax.
867+
ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`. This has two effects:
868+
869+
- They share the same internal grammar, which allows the use of any Unicode character properties in addition to the POSIX properties.
870+
- The POSIX syntax may be used outside of custom character classes, unlike in PCRE and Oniguruma.
871+
872+
We propose following both of these rules. The former is purely additive, and therefore should not conflict with regex engines that implement a more limited POSIX syntax. The latter does conflict with other engines, but we feel it is much more likely that a user would expect e.g `[:space:]` to be a character property rather than the character class `[:aceps]`. We do however feel that a warning might be warranted in order to avoid confusion.
873+
874+
### POSIX character property disambiguation
875+
876+
PCRE, Oniguruma and ICU allow `[:` to be part of a custom character class if a closing `:]` is not present. For example, `[:a]` is the character class of `:` and `a`. However they each have different rules for detecting the closing `:]`:
877+
878+
- PCRE will scan ahead until it hits either `:]`, `]`, or `[:`.
879+
- Oniguruma will scan ahead until it hits either `:]`, `]`, or the length exceeds 20 characters.
880+
- ICU will scan ahead until it hits a known escape sequence (e.g `\a`, `\e`, `\Q`, ...), or `:]`. Note this excludes character class escapes e.g `\d`. It also excludes `]`, meaning that even `[:a][:]` is parsed as a POSIX character property.
881+
882+
We propose unifying these behaviors by scanning ahead until we hit either `[`, `]`, `:]`, or `\`. Additionally, we will stop on encountering `}` or a second occurrence of `=`. These fall out the fact that they would be invalid contents of the alternative `\p{...}` syntax.
883+
867884

868885
### Script properties
869886

0 commit comments

Comments
 (0)