For the common problem of matching text between delimiters (e.g. <
and >
), there's two common patterns:
- using the greedy
*
or +
quantifier in the form START [^END]* END
, e.g. <[^>]*>
, or
- using the lazy
*?
or +?
quantifier in the form START .*? END
, e.g. <.*?>
.
Is there a particular reason to favour one over the other?
The first is more explicit, i. e. it definitely excludes the closing delimiter from being part of the matched text. This is not guaranteed in the second case (if the regular expression is extended to match more than just this tag).
Example: If you try to match <tag1><tag2>Hello!
with <.*?>Hello!
, the regex will match
<tag1><tag2>Hello!
whereas <[^>]*>Hello!
will match
<tag2>Hello!
What most people fail to consider when approaching questions like this is what happens when the regex is unable to find a match. That's when the killer performance sinkholes are most likely to appear. For example, take Tim's example, where you're looking for something like <tag>Hello!
. Consider what happens with:
<.*?>Hello!
The regex engine finds a <
and it quickly finds a closing >
, but not >Hello!
. So the .*?
continues looking for a >
that is followed by Hello!
. If there isn't one, it will go all the way to the end of the document before it gives up. Then the regex engine resumes scanning until it finds another <
, and tries again. We already know how that's going to turn out, but the regex engine, typically, doesn't; it goes through the same rigamarole with every <
in the document. Now consider the other regex:
<[^>]*>Hello!
As before, it quickly matches from the <
to the >
, but fails to match Hello!
. It will backtrack to the <
, then quit and start scanning for another <
. It will still check every <
like the first regex did, but it won't search all the way to the end of the document every time it finds one.
But it's even worse than that. If you think about it, .*?
is effectively equivalent to a negative lookahead. It's saying "Before consuming the next character, make sure the remainder of the regex can't match at this position." In other words,
/<.*?>Hello!/
...is equivalent to:
/<(?:(?!>Hello!).)*(?:>Hello!|\z(*FAIL))/
So at every position you're performing, not just a normal match attempt, but a much more expensive lookahead. (It's at least twice as costly, because the lookahead has to scan at least one character, then the .
goes ahead and consumes a character.)
((*FAIL)
is one of Perl's backtracking-control verbs (also supported in PHP). |\z(*FAIL)
means "or reach the end of the document and give up".)
Finally, there's another advantage of the negated-character-class approach. While it doesn't (as @Bart pointed out) act like the quantifier is possessive, there's nothing to stop you from making it possessive, if your flavor supports it:
/<[^>]*+>Hello!/
...or wrap it in an atomic group:
/(?><[^>]*>)Hello!/
Not only will these regexes never backtrack unnecessarily, they don't have to save the state information that makes backtracking possible.