Greediness behaving differently in JavaScript?

2019-03-14 10:00发布

There was this question which made me realise that greediness of quantifiers is not always the same in certain regex engines. Taking the regex from that question and modifying it a bit:

!\[(.*?)*\]

(I know that * is redundant here, but I found what's following to be quite an interesting behaviour).

And if we try to match against:

![][][]

I expected to get the first capture group to be empty, because (.*?) is lazy and will stop at the first ] it comes across. This is indeed what happens in:

I looked around with some other languages, for instance ruby, java, C# but all behave like I expected them to (i.e. return empty capture groups).

(regexplanet's golang flavour apparently also gets non-empty capture groups)

It seems that JavaScript's regex engine is interpreting the second * to convert .*? from lazy to greedy. Note that converting the second * to *? seems to make the regex work as I expected (as does removing the quantifier completely, because I know that it's being redundant in this situation, but that's not the point).

* was used in the regex, but this behaviour is similar with +, ? or {m,n} and converting them to their lazy version gives the same results as with *?.

Does anyone know what's really happening?

3条回答
孤傲高冷的网名
2楼-- · 2019-03-14 10:33

I did some tests and found that in Firefox and Chrome, if a group has a greedy quantifier and directly or indirectly contains one or more lazy quantifiers with zero as the minimum number of iterations, then for the iterations of the greedy quantifier where the minimum number of iterations has already been satisfied, the leftmost lazy quantifier that can match one or more iterations will match one iteration if the group would find a zero-length match if the lazy quantifier were to match zero iterations.

E.g. (.{0,2}?){5,8} matches "abc" in "abcdefghijklmnopqrstuvwxyz" because .{0,2}? matches one iteration for iterations 6, 7, and 8 of {5,8}.

If there are tokens after the group with the greedy quantifier that can't be matched, then the lazy quantifier expands its number of iterations. The permutation with zero iterations is never attempted. But the greedy quantifier can still backtrack to its minimum number of iterations.

On the same subject string, (.{0,3}?){5,6}[ad] matches "abcd" while (.{0,3}?){5,6}a matches "a".

If there is anything else inside the group that finds a match, then the lazy quantifiers behave as they do in other regex engines.

The same happens in Internet Explorer if and only if there are tokens that are not optional after the group with the greedy quantifier. If there are no tokens after the group or they are all optional then IE behaves like most other regex engines do.

The explanation for the behavior in Firefox and Chrome seems to be the combination of two steps in section 15.10.2.5 in the JavaScript standard. Step 2.1 for RepeatMatcher makes the regex engine fail zero-length iterations of quantifiers that have already matched the minimum number of required iterations, instead of merely halting continued iteration. Step 9 evaluates whatever comes after a lazy quantifier before evaluating the lazy quantifier itself. In these examples that includes the continued repetition of the greedy quantifier. When this greedy quantifier fails in step 2.1, the lazy quantifier is forced to repeat at least once.

So to answer whether this is a bug, I would say it is a bug in the JavaScript specification. There is no benefit to this behavior and makes JavaScript regexes needlessly different from all other (popular) backtracking regex engines. I do not expect that future JavaScript specifications will change this though.

Opera behaves differently. (.{0,2}?){5,8} matches "abcd" while (.{0,2}?){6,8} matches "abcde". Opera seems to force the lazy quantifier to match at least one iteration for all but the first iteration of the greedy quantifier, and then stop iterating when the greedy quantifier has found the required minimum number of iterations.

I recommend not using groups or alternatives in which everything is optional. Make sure each alternative and each group matches something. If the group needs to be optional, use ? to make the whole group optional, instead of making everything inside the group optional. This will reduce the number of permutations the regex engine has to try.

查看更多
啃猪蹄的小仙女
3楼-- · 2019-03-14 10:39

This is not answering why grediness is behaving differently in Javascript but it shows that this is not a bug and it was tested to behave so. I will take as example google's v8 engine. I found a similar example in their tests.

/test/mjsunit/third_party/regexp-pcre.js:

line 1085: res[1006] = /([a]*?)*/;
line 4822: assertToStringEquals("aaaa,a", res[1006].exec("aaaa "), 3176);

https://code.google.com/p/v8/source/browse/trunk/test/mjsunit/third_party/regexp-pcre.js#1085

This code works well for Javascript http://regex101.com/r/nD0uG8 but it does not have the same output for PCRE php and python.

UPDATE: About your question:

It seems that JavaScript's regex engine is interpreting the second * to convert .*? from lazy to greedy

https://code.google.com/p/v8/source/browse/trunk/src/parser.cc#5157

RegExpQuantifier::QuantifierType quantifier_type = RegExpQuantifier::GREEDY;
if (current() == '?') {
    quantifier_type = RegExpQuantifier::NON_GREEDY;
    Advance();
} else if (FLAG_regexp_possessive_quantifier && current() == '+') {
    // FLAG_regexp_possessive_quantifier is a debug-only flag.
    quantifier_type = RegExpQuantifier::POSSESSIVE;
    Advance();
}
查看更多
孤傲高冷的网名
4楼-- · 2019-03-14 10:55

Short answer

The behavior is defined in ECMA-262 specs in section 15.10.2 Pattern Semantics, especially in 15.10.2.5 where it discusses the semantics of the production Term :: Atom Quantifier.

As a slight generalization: Let E be a pattern that can match empty string. If there exists an input string S where empty string is the first matchable choice in E, patterns which contain a greedy repetition of pattern E are affected. For example:

(a*?)*              against     aaaa
!\[(.*?)*\]         against     ![][][]
(.*?){1,4}          against     asdf
(|a)*               against     aaaa
(a|()|b){0,4}       against     aaabbb

Firefox and other Webkit-based browsers seems to follow the specs closely, while IE is off the specs in the case when there is no sequel to the affected pattern.

Long answer

The relevant part of the specs is quoted below. Some part of the specs is omitted ([...]) to keep the discussion relevant. I will explain by condensing the information from the specs while keeping it simple.

Vocabulary

  • A State is an ordered pair (endIndex, captures) where endIndex is an integer and captures is an internal array of NcapturingParens values. States are used to represent partial match states in the regular expression matching algorithms. The endIndex is one plus the index of the last input character matched so far by the pattern, while captures holds the results of capturing parentheses. [...]. Due to backtracking, many States may be in use at any time during the matching process.
  • A MatchResult is either a State or the special token failure that indicates that the match failed.

There should be no confusion here. State keeps track of the position that has been processed so far (and also the captures, which we are not interested in for the moment). MatchResult, well, is the match result (duh!).

  • A Continuation procedure is an internal closure (i.e. an internal procedure with some arguments already bound to values) that takes one State argument and returns a MatchResult result. If an internal closure references variables bound in the function that creates the closure, the closure uses the values that these variables had at the time the closure was created. The Continuation attempts to match the remaining portion (specified by the closure's already-bound arguments) of the pattern against the input String, starting at the intermediate state given by its State argument. If the match succeeds, the Continuation returns the final State that it reached; if the match fails, the Continuation returns failure.
  • A Matcher procedure is an internal closure that takes two arguments -- a State and a Continuation -- and returns a MatchResult result. A Matcher attempts to match a middle subpattern (specified by the closure's already-bound arguments) of the pattern against the input String, starting at the intermediate state given by its State argument. The Continuation argument should be a closure that matches the rest of the pattern. After matching the subpattern of a pattern to obtain a new State, the Matcher then calls Continuation on that new State to test if the rest of the pattern can match as well. If it can, the Matcher returns the State returned by Continuation; if not, the Matcher may try different choices at its choice points, repeatedly calling Continuation until it either succeeds or all possibilities have been exhausted.

A Matcher contains code to match a subpattern that it represents, and it will call Continuation to continue the match. And Continuation contains code to continue the match from where Matcher left off. They both accepts State as an argument to know where to start matching.

Production Term :: Atom Quantifier

The production Term :: Atom Quantifier evaluates as follows:

  1. Evaluate Atom to obtain a Matcher m.
  2. Evaluate Quantifier to obtain the three results: an integer min, an integer (or ∞) max, and Boolean greedy.
  3. If max is finite and less than min, then throw a SyntaxError exception.
  4. Let parenIndex be the number of left capturing parentheses in the entire regular expression that occur to the left of this production expansion's Term. [...]
  5. Let parenCount be the number of left capturing parentheses in the expansion of this production's Atom. [...]
  6. Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following:
    1. Call RepeatMatcher(m, min, max, greedy, x, c, parenIndex, parenCount) and return its result.

Take note that m is the Matcher for the Atom that is being repeated, and Continuation c is passed in from the code generated from higher level production rules.

The abstract operation RepeatMatcher takes eight parameters, a Matcher m, an integer min, an integer (or ∞) max, a Boolean greedy, a State x, a Continuation c, an integer parenIndex, and an integer parenCount, and performs the following:

  1. If max is zero, then call c(x) and return its result.
  2. Create an internal Continuation closure d that takes one State argument y and performs the following:
    1. If min is zero and y's endIndex is equal to x's endIndex, then return failure.
    2. If min is zero then let min2 be zero; otherwise let min2 be min - 1.
    3. If max is ∞, then let max2 be ∞; otherwise let max2 be max - 1.
    4. Call RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount) and return its result.
  3. Let cap be a fresh copy of x's captures internal array.
  4. For every integer k that satisfies parenIndex < k and k ? parenIndex + parenCount, set cap[k] to undefined.
  5. Let e be x's endIndex.
  6. Let xr be the State (e, cap).
  7. If min is not zero, then call m(xr, d) and return its result.
  8. If greedy is false, then
    1. Call c(x) and let z be its result.
    2. If z is not failure, return z.
    3. Call m(xr, d) and return its result.
  9. Call m(xr, d) and let z be its result.
  10. If z is not failure, return z.
  11. Call c(x) and return its result.

Step 2 defines a Continuation d where we try to match another instance of the repeated Atom, which is later called in step 7 (min > 0 case), step 8.3 (lazy case) and step 9 (greedy case) via the Matcher m.

Step 5 and 6 creates a copy of the current State, for backtracking purpose, and to detect empty string match in step 2.

The steps from here on describes 3 separate cases:

  • Step 7 covers the case where the quantifier has non-zero min value. Greediness regardless, we need to match at least min instances of Atom, before we are allowed to call the Continuation c.

  • Due to the condition in step 7, min is 0 from this point on. Step 8 deals with the case where the quantifier is lazy. Step 9, 10, 11 deals with the case where the quantifier is greedy. Note the reversed order of calling.

Calling m(xr, d) means matching one instance of the Atom, then calling Continuation d to continue the repetition.

Calling Continuation c(x) means getting out of the repetition and matching whatever comes next. Note how Continuation c is passed to the RepeatMatcher inside Continuation d, so that it is available to all subsequent iteration of the repetition.

Explanation of the quirk in JavaScript RegExp

Step 2.1 is the culprit that causes the discrepancy in the result between PCRE and JavaScript.

  1. If min is zero and y's endIndex is equal to x's endIndex, then return failure.

The condition min = 0 is reached when the quantifier originally has 0 as min (* or {0,n}) or via step 7, which must be called as long as min > 0 (original quantifier is + or {n,} or {n,m}).

When min = 0 and the quantifier is greedy, Matcher m will be called (in step 9), which attempts to match an instance of Atom and call Continuation d that is set up in step 2. Continuation d will recursively call RepeatMatcher, which in turn will call Matcher m (step 9) again.

Without step 2.1, if Matcher m has empty string as its first possible choice to advance ahead in the input, the iteration will (theoretically) continue forever without advance ahead. Given the limited features that JavaScript RegExp supports (no forward-declared back-reference or other fancy features), there is no need to go ahead with another iteration when the current iteration matches empty string - an empty string is going to be matched again anyway. Step 2.1 is JavaScript's method of dealing with repetition of empty string.

As set up in step 2.1, when min = 0, the JavaScript regex engine will prune when an empty string is matched by the Matcher m (return failure). This effectively rejects "empty string repeated finitely many times is an empty string".

Another side effect of step 2.1 is that from the point when min = 0, as long as there is one instance where Matcher m matches non-empty string, the last repetition of Atom is guaranteed to be non-empty.

In contrast, it seems PCRE (and other engines) accepts "empty string repeated finitely many times is an empty string", and continue on with the rest of the pattern. This explains the behavior of PCRE in the cases listed above. As for the algorithm, PCRE stops the repetition after matching empty string twice in a row.

查看更多
登录 后发表回答