Matching text between delimiters: greedy or lazy r

2019-03-10 05:28发布

For the common problem of matching text between delimiters (e.g. < and >), there's two common patterns:

  • using the greedy * or + quantifier in the form START [^END]* END, e.g. <[^>]*>, or
  • using the lazy *? or +? quantifier in the form START .*? END, e.g. <.*?>.

Is there a particular reason to favour one over the other?

3条回答
smile是对你的礼貌
2楼-- · 2019-03-10 05:39

What most people fail to consider when approaching questions like this is what happens when the regex is unable to find a match. That's when the killer performance sinkholes are most likely to appear. For example, take Tim's example, where you're looking for something like <tag>Hello!. Consider what happens with:

<.*?>Hello!

The regex engine finds a < and it quickly finds a closing >, but not >Hello!. So the .*? continues looking for a > that is followed by Hello!. If there isn't one, it will go all the way to the end of the document before it gives up. Then the regex engine resumes scanning until it finds another <, and tries again. We already know how that's going to turn out, but the regex engine, typically, doesn't; it goes through the same rigamarole with every < in the document. Now consider the other regex:

<[^>]*>Hello!

As before, it quickly matches from the < to the >, but fails to match Hello!. It will backtrack to the <, then quit and start scanning for another <. It will still check every < like the first regex did, but it won't search all the way to the end of the document every time it finds one.

But it's even worse than that. If you think about it, .*? is effectively equivalent to a negative lookahead. It's saying "Before consuming the next character, make sure the remainder of the regex can't match at this position." In other words,

/<.*?>Hello!/

...is equivalent to:

/<(?:(?!>Hello!).)*(?:>Hello!|\z(*FAIL))/

So at every position you're performing, not just a normal match attempt, but a much more expensive lookahead. (It's at least twice as costly, because the lookahead has to scan at least one character, then the . goes ahead and consumes a character.)

((*FAIL) is one of Perl's backtracking-control verbs (also supported in PHP). |\z(*FAIL) means "or reach the end of the document and give up".)

Finally, there's another advantage of the negated-character-class approach. While it doesn't (as @Bart pointed out) act like the quantifier is possessive, there's nothing to stop you from making it possessive, if your flavor supports it:

/<[^>]*+>Hello!/

...or wrap it in an atomic group:

/(?><[^>]*>)Hello!/

Not only will these regexes never backtrack unnecessarily, they don't have to save the state information that makes backtracking possible.

查看更多
不美不萌又怎样
3楼-- · 2019-03-10 05:54

The first is more explicit, i. e. it definitely excludes the closing delimiter from being part of the matched text. This is not guaranteed in the second case (if the regular expression is extended to match more than just this tag).

Example: If you try to match <tag1><tag2>Hello! with <.*?>Hello!, the regex will match

<tag1><tag2>Hello!

whereas <[^>]*>Hello! will match

<tag2>Hello!
查看更多
爷、活的狠高调
4楼-- · 2019-03-10 05:55

Some advantages:

[^>]*:

  • More expressive.
  • Captures newlines regardless of /s flag.
  • Considered quicker, because the engine doesn't have to backtracks to find a successful match (with [^>] the engine doesn't make choices - we give it only one way to match the pattern against the string).

.*?

  • No "code duplication" - the end character only appears once.
  • Simpler in cases the end delimiter is more than a character long. (a character class would not work in this case) A common alternative is (?:(?!END).)*. This is even worse if the END delimiter is another pattern.
查看更多
登录 后发表回答