Matching text between delimiters: greedy or lazy r

For the common problem of matching text between delimiters (e.g. < and >), there's two common patterns:

using the greedy * or + quantifier in the form START [^END]* END, e.g. <[^>]*>, or
using the lazy *? or +? quantifier in the form START .*? END, e.g. <.*?>.

Is there a particular reason to favour one over the other?

标签： regex language-agnostic greedy regex-greedy

3条回答

2楼-- · 2019-03-10 05:39

What most people fail to consider when approaching questions like this is what happens when the regex is unable to find a match. That's when the killer performance sinkholes are most likely to appear. For example, take Tim's example, where you're looking for something like <tag>Hello!. Consider what happens with:

<.*?>Hello!

The regex engine finds a < and it quickly finds a closing >, but not >Hello!. So the .*? continues looking for a > that is followed by Hello!. If there isn't one, it will go all the way to the end of the document before it gives up. Then the regex engine resumes scanning until it finds another <, and tries again. We already know how that's going to turn out, but the regex engine, typically, doesn't; it goes through the same rigamarole with every < in the document. Now consider the other regex:

<[^>]*>Hello!

As before, it quickly matches from the < to the >, but fails to match Hello!. It will backtrack to the <, then quit and start scanning for another <. It will still check every < like the first regex did, but it won't search all the way to the end of the document every time it finds one.

But it's even worse than that. If you think about it, .*? is effectively equivalent to a negative lookahead. It's saying "Before consuming the next character, make sure the remainder of the regex can't match at this position." In other words,

/<.*?>Hello!/

...is equivalent to:

/<(?:(?!>Hello!).)*(?:>Hello!|\z(*FAIL))/

So at every position you're performing, not just a normal match attempt, but a much more expensive lookahead. (It's at least twice as costly, because the lookahead has to scan at least one character, then the . goes ahead and consumes a character.)

((*FAIL) is one of Perl's backtracking-control verbs (also supported in PHP). |\z(*FAIL) means "or reach the end of the document and give up".)

Finally, there's another advantage of the negated-character-class approach. While it doesn't (as @Bart pointed out) act like the quantifier is possessive, there's nothing to stop you from making it possessive, if your flavor supports it:

/<[^>]*+>Hello!/

...or wrap it in an atomic group:

/(?><[^>]*>)Hello!/

Not only will these regexes never backtrack unnecessarily, they don't have to save the state information that makes backtracking possible.

0人赞添加讨论(0) 举报

不美不萌又怎样

3楼-- · 2019-03-10 05:54

The first is more explicit, i. e. it definitely excludes the closing delimiter from being part of the matched text. This is not guaranteed in the second case (if the regular expression is extended to match more than just this tag).

Example: If you try to match <tag1><tag2>Hello! with <.*?>Hello!, the regex will match

<tag1><tag2>Hello!

whereas <[^>]*>Hello! will match

<tag2>Hello!

0人赞添加讨论(0) 举报

爷、活的狠高调

4楼-- · 2019-03-10 05:55

Some advantages:

[^>]*:

More expressive.
Captures newlines regardless of /s flag.
Considered quicker, because the engine doesn't have to backtracks to find a successful match (with [^>] the engine doesn't make choices - we give it only one way to match the pattern against the string).

.*?

No "code duplication" - the end character only appears once.
Simpler in cases the end delimiter is more than a character long. (a character class would not work in this case) A common alternative is (?:(?!END).)*. This is even worse if the END delimiter is another pattern.

0人赞添加讨论(0) 举报

Matching text between delimiters: greedy or lazy r

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间