@@ -129,9 +129,9 @@ If not, we can reject the match immediately without iterating through many
129129possibilities.
130130
131131As an example, consider the regex "(a[bc]+)\1". The compiled
132- representation will have a top-level concatenation subre node. Its left
133- child is a capture node, and the child of that is a plain DFA node for
134- "a[bc]+" . The concatenation's right child is a backref node for \1.
132+ representation will have a top-level concatenation subre node. Its first
133+ child is a plain DFA node for "a[bc]+" (which is marked as being a capture
134+ node) . The concatenation's second child is a backref node for \1.
135135The DFA associated with the concatenation node will be "a[bc]+a[bc]+",
136136where the backref has been replaced by a copy of the DFA for its referent
137137expression. When executed, the concatenation node will have to search for
@@ -147,6 +147,17 @@ run much faster than a pure NFA engine could do. It is this behavior that
147147justifies using the phrase "hybrid DFA/NFA engine" to describe Spencer's
148148library.
149149
150+ It's perhaps worth noting that separate capture subre nodes are a rarity:
151+ normally, we just mark a subre as capturing and that's it. However, it's
152+ legal to write a regex like "((x))" in which the same substring has to be
153+ captured by multiple sets of parentheses. Since a subre has room for only
154+ one "capno" field, a single subre can't handle that. We handle such cases
155+ by wrapping the base subre (which captures the innermost parens) in a
156+ no-op capture node, or even more than one for "(((x)))" etc. This is a
157+ little bit inefficient because we end up with multiple identical NFAs,
158+ but since the case is pointless and infrequent, it's not worth working
159+ harder.
160+
150161
151162Colors and colormapping
152163-----------------------
@@ -261,6 +272,18 @@ and the NFA has these arcs:
261272 states 4 -> 5 on color 2 ("x" only)
262273which can be seen to be a correct representation of the regex.
263274
275+ There is one more complexity, which is how to handle ".", that is a
276+ match-anything atom. We used to do that by generating a "rainbow"
277+ of arcs of all live colors between the two NFA states before and after
278+ the dot. That's expensive in itself when there are lots of colors,
279+ and it also typically adds lots of follow-on arc-splitting work for the
280+ color splitting logic. Now we handle this case by generating a single arc
281+ labeled with the special color RAINBOW, meaning all colors. Such arcs
282+ never need to be split, so they help keep NFAs small in this common case.
283+ (Note: this optimization doesn't help in REG_NLSTOP mode, where "." is
284+ not supposed to match newline. In that case we still handle "." by
285+ generating an almost-rainbow of all colors except newline's color.)
286+
264287Given this summary, we can see we need the following operations for
265288colors:
266289
@@ -349,18 +372,20 @@ The possible arc types are:
349372
350373 PLAIN arcs, which specify matching of any character of a given "color"
351374 (see above). These are dumped as "[color_number]->to_state".
375+ In addition there can be "rainbow" PLAIN arcs, which are dumped as
376+ "[*]->to_state".
352377
353378 EMPTY arcs, which specify a no-op transition to another state. These
354379 are dumped as "->to_state".
355380
356381 AHEAD constraints, which represent a "next character must be of this
357382 color" constraint. AHEAD differs from a PLAIN arc in that the input
358383 character is not consumed when crossing the arc. These are dumped as
359- ">color_number>->to_state".
384+ ">color_number>->to_state", or possibly ">*>->to_state" .
360385
361386 BEHIND constraints, which represent a "previous character must be of
362387 this color" constraint, which likewise consumes no input. These are
363- dumped as "<color_number<->to_state".
388+ dumped as "<color_number<->to_state", or possibly "<*<->to_state" .
364389
365390 '^' arcs, which specify a beginning-of-input constraint. These are
366391 dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
@@ -396,14 +421,20 @@ substring, or an imaginary following EOS character if the substring is at
396421the end of the input.
3974223. If the NFA is (or can be) in the goal state at this point, it matches.
398423
424+ This definition is necessary to support regexes that begin or end with
425+ constraints such as \m and \M, which imply requirements on the adjacent
426+ character if any. The executor implements that by checking if the
427+ adjacent character (or BOS/BOL/EOS/EOL pseudo-character) is of the
428+ right color, and it does that in the same loop that checks characters
429+ within the match.
430+
399431So one can mentally execute an untransformed NFA by taking ^ and $ as
400432ordinary constraints that match at start and end of input; but plain
401433arcs out of the start state should be taken as matches for the character
402434before the target substring, and similarly, plain arcs leading to the
403435post state are matches for the character after the target substring.
404- This definition is necessary to support regexes that begin or end with
405- constraints such as \m and \M, which imply requirements on the adjacent
406- character if any. NFAs for simple unanchored patterns will usually have
407- pre-state outarcs for all possible character colors as well as BOS and
408- BOL, and post-state inarcs for all possible character colors as well as
409- EOS and EOL, so that the executor's behavior will work.
436+ After the optimize() transformation, there are explicit arcs mentioning
437+ BOS/BOL/EOS/EOL adjacent to the pre-state and post-state. So a finished
438+ NFA for a pattern without anchors or adjacent-character constraints will
439+ have pre-state outarcs for RAINBOW (all possible character colors) as well
440+ as BOS and BOL, and likewise post-state inarcs for RAINBOW, EOS, and EOL.
0 commit comments