There was this question which made me realise that greediness of quantifiers is not always the same in certain regex engines. Taking the regex from that question and modifying it a bit:
!\[(.*?)*\]
(I know that *
is redundant here, but I found what's following to be quite an interesting behaviour).
And if we try to match against:
![][][]
I expected to get the first capture group to be empty, because (.*?)
is lazy and will stop at the first ]
it comes across. This is indeed what happens in:
- PCRE
- Python
- but not Javascript where it matches the whole
][][
. (jsfiddle)
I looked around with some other languages, for instance ruby, java, C# but all behave like I expected them to (i.e. return empty capture groups).
(regexplanet's golang flavour apparently also gets non-empty capture groups)
It seems that JavaScript's regex engine is interpreting the second *
to convert .*?
from lazy to greedy. Note that converting the second *
to *?
seems to make the regex work as I expected (as does removing the quantifier completely, because I know that it's being redundant in this situation, but that's not the point).
*
was used in the regex, but this behaviour is similar with +
, ?
or {m,n}
and converting them to their lazy version gives the same results as with *?
.
Does anyone know what's really happening?
I did some tests and found that in Firefox and Chrome, if a group has a greedy quantifier and directly or indirectly contains one or more lazy quantifiers with zero as the minimum number of iterations, then for the iterations of the greedy quantifier where the minimum number of iterations has already been satisfied, the leftmost lazy quantifier that can match one or more iterations will match one iteration if the group would find a zero-length match if the lazy quantifier were to match zero iterations.
E.g.
(.{0,2}?){5,8}
matches "abc" in "abcdefghijklmnopqrstuvwxyz" because.{0,2}?
matches one iteration for iterations 6, 7, and 8 of{5,8}
.If there are tokens after the group with the greedy quantifier that can't be matched, then the lazy quantifier expands its number of iterations. The permutation with zero iterations is never attempted. But the greedy quantifier can still backtrack to its minimum number of iterations.
On the same subject string,
(.{0,3}?){5,6}[ad]
matches "abcd" while(.{0,3}?){5,6}a
matches "a".If there is anything else inside the group that finds a match, then the lazy quantifiers behave as they do in other regex engines.
The same happens in Internet Explorer if and only if there are tokens that are not optional after the group with the greedy quantifier. If there are no tokens after the group or they are all optional then IE behaves like most other regex engines do.
The explanation for the behavior in Firefox and Chrome seems to be the combination of two steps in section 15.10.2.5 in the JavaScript standard. Step 2.1 for RepeatMatcher makes the regex engine fail zero-length iterations of quantifiers that have already matched the minimum number of required iterations, instead of merely halting continued iteration. Step 9 evaluates whatever comes after a lazy quantifier before evaluating the lazy quantifier itself. In these examples that includes the continued repetition of the greedy quantifier. When this greedy quantifier fails in step 2.1, the lazy quantifier is forced to repeat at least once.
So to answer whether this is a bug, I would say it is a bug in the JavaScript specification. There is no benefit to this behavior and makes JavaScript regexes needlessly different from all other (popular) backtracking regex engines. I do not expect that future JavaScript specifications will change this though.
Opera behaves differently.
(.{0,2}?){5,8}
matches "abcd" while(.{0,2}?){6,8}
matches "abcde". Opera seems to force the lazy quantifier to match at least one iteration for all but the first iteration of the greedy quantifier, and then stop iterating when the greedy quantifier has found the required minimum number of iterations.I recommend not using groups or alternatives in which everything is optional. Make sure each alternative and each group matches something. If the group needs to be optional, use
?
to make the whole group optional, instead of making everything inside the group optional. This will reduce the number of permutations the regex engine has to try.This is not answering why grediness is behaving differently in Javascript but it shows that this is not a bug and it was tested to behave so. I will take as example google's v8 engine. I found a similar example in their tests.
/test/mjsunit/third_party/regexp-pcre.js:
https://code.google.com/p/v8/source/browse/trunk/test/mjsunit/third_party/regexp-pcre.js#1085
This code works well for Javascript http://regex101.com/r/nD0uG8 but it does not have the same output for PCRE php and python.
UPDATE: About your question:
https://code.google.com/p/v8/source/browse/trunk/src/parser.cc#5157
Short answer
The behavior is defined in ECMA-262 specs in section 15.10.2 Pattern Semantics, especially in 15.10.2.5 where it discusses the semantics of the production Term :: Atom Quantifier.
As a slight generalization: Let E be a pattern that can match empty string. If there exists an input string S where empty string is the first matchable choice in E, patterns which contain a greedy repetition of pattern E are affected. For example:
Firefox and other Webkit-based browsers seems to follow the specs closely, while IE is off the specs in the case when there is no sequel to the affected pattern.
Long answer
The relevant part of the specs is quoted below. Some part of the specs is omitted ([...]) to keep the discussion relevant. I will explain by condensing the information from the specs while keeping it simple.
Vocabulary
There should be no confusion here. State keeps track of the position that has been processed so far (and also the captures, which we are not interested in for the moment). MatchResult, well, is the match result (duh!).
A Matcher contains code to match a subpattern that it represents, and it will call Continuation to continue the match. And Continuation contains code to continue the match from where Matcher left off. They both accepts State as an argument to know where to start matching.
Production Term :: Atom Quantifier
Take note that
m
is the Matcher for the Atom that is being repeated, and Continuation c is passed in from the code generated from higher level production rules.Step 2 defines a Continuation d where we try to match another instance of the repeated Atom, which is later called in step 7 (min > 0 case), step 8.3 (lazy case) and step 9 (greedy case) via the Matcher
m
.Step 5 and 6 creates a copy of the current State, for backtracking purpose, and to detect empty string match in step 2.
The steps from here on describes 3 separate cases:
Step 7 covers the case where the quantifier has non-zero min value. Greediness regardless, we need to match at least min instances of Atom, before we are allowed to call the Continuation c.
Due to the condition in step 7, min is 0 from this point on. Step 8 deals with the case where the quantifier is lazy. Step 9, 10, 11 deals with the case where the quantifier is greedy. Note the reversed order of calling.
Calling m(xr, d) means matching one instance of the Atom, then calling Continuation d to continue the repetition.
Calling Continuation c(x) means getting out of the repetition and matching whatever comes next. Note how Continuation c is passed to the RepeatMatcher inside Continuation d, so that it is available to all subsequent iteration of the repetition.
Explanation of the quirk in JavaScript RegExp
Step 2.1 is the culprit that causes the discrepancy in the result between PCRE and JavaScript.
The condition min = 0 is reached when the quantifier originally has 0 as min (
*
or{0,n}
) or via step 7, which must be called as long as min > 0 (original quantifier is+
or{n,}
or{n,m}
).When min = 0 and the quantifier is greedy, Matcher m will be called (in step 9), which attempts to match an instance of Atom and call Continuation d that is set up in step 2. Continuation d will recursively call RepeatMatcher, which in turn will call Matcher m (step 9) again.
Without step 2.1, if Matcher m has empty string as its first possible choice to advance ahead in the input, the iteration will (theoretically) continue forever without advance ahead. Given the limited features that JavaScript RegExp supports (no forward-declared back-reference or other fancy features), there is no need to go ahead with another iteration when the current iteration matches empty string - an empty string is going to be matched again anyway. Step 2.1 is JavaScript's method of dealing with repetition of empty string.
As set up in step 2.1, when min = 0, the JavaScript regex engine will prune when an empty string is matched by the Matcher m (return failure). This effectively rejects "empty string repeated finitely many times is an empty string".
Another side effect of step 2.1 is that from the point when min = 0, as long as there is one instance where Matcher
m
matches non-empty string, the last repetition of Atom is guaranteed to be non-empty.In contrast, it seems PCRE (and other engines) accepts "empty string repeated finitely many times is an empty string", and continue on with the rest of the pattern. This explains the behavior of PCRE in the cases listed above. As for the algorithm, PCRE stops the repetition after matching empty string twice in a row.