Writing better regex expression for not using lazy

2019-06-21 15:45发布

I have a regular expression:

(<select([^>]*>))(.*?)(</select\s*>)

Since it uses lazy repeat quantifier, for longer strings(having options more than 500) it backtracks for more than 100,000 times and fails. Please help me to find a better regular expression which doesn't use lazy repeat quantifier

2条回答
Fickle 薄情
2楼-- · 2019-06-21 16:30
<select[^>]*>[^<]*(?:<(?!/select>)[^<]*)*</select>

...or in human-readable form:

<select[^>]*>    # start tag
[^<]*            # anything except opening bracket
(?:              # if you find an open bracket
  <(?!/select>)  #   match it if it's not part of end tag
  [^<]*          #   consume any more non-brackets
)*               # repeat as needed
</select>        # end tag

This is an example of the "unrolled loop" technique Friedl develops in his book, Mastering Regular Expressions. I did a quick test in RegexBuddy using a pattern based on reluctant quantifiers:

(?s)<select[^>]*>.*?</select>

...and it took about 6,000 steps to find a match. The unrolled-loop pattern took only 500 steps. And when I removed the closing bracket from the end tag (</select), making a match impossible, it required only 800 steps to report failure.

If your regex flavor supports possessive quantifiers, go ahead and use them, too:

<select[^>]*+>[^<]*+(?:<(?!/select>)[^<]*+)*+</select>

It takes about the same number of steps to achieve a match, but it can use a lot less memory in the process. And if no match is possible, it fails even more quickly; in my tests it took about 500 steps, the same number it took to find a match.

查看更多
祖国的老花朵
3楼-- · 2019-06-21 16:41

Unfortunately this wont work, see the answer by Alan Moore for a correct example!

(<select([^>]*>))(.*+)(</select\s*>)

From the perl regexp manpage:

By default, when a quantified subpattern does not allow the rest of the overall pattern to match, Perl will backtrack. However, this behaviour is sometimes undesirable. Thus Perl provides the "possessive" quantifier form as well.

       *+     Match 0 or more times and give nothing back
       ++     Match 1 or more times and give nothing back
       ?+     Match 0 or 1 time and give nothing back
       {n}+   Match exactly n times and give nothing back (redundant)
       {n,}+  Match at least n times and give nothing back
       {n,m}+ Match at least n but not more than m times and give nothing back

For instance,

      'aaaa' =~ /a++a/

will never match, as the "a++" will gobble up all the "a"'s in the string and won't leave any for the remaining part of the pattern. This feature can be extremely useful to give perl hints about where it shouldn't backtrack. For instance, the typical "match a double-quoted string" problem can be most efficiently performed when written as:

      /"(?:[^"\\]++|\\.)*+"/
查看更多
登录 后发表回答