I have a regular expression:
(<select([^>]*>))(.*?)(</select\s*>)
Since it uses lazy repeat quantifier, for longer strings(having options more than 500) it backtracks for more than 100,000 times and fails.
Please help me to find a better regular expression which doesn't use lazy repeat quantifier
<select[^>]*>[^<]*(?:<(?!/select>)[^<]*)*</select>
...or in human-readable form:
<select[^>]*> # start tag
[^<]* # anything except opening bracket
(?: # if you find an open bracket
<(?!/select>) # match it if it's not part of end tag
[^<]* # consume any more non-brackets
)* # repeat as needed
</select> # end tag
This is an example of the "unrolled loop" technique Friedl develops in his book, Mastering Regular Expressions. I did a quick test in RegexBuddy using a pattern based on reluctant quantifiers:
(?s)<select[^>]*>.*?</select>
...and it took about 6,000 steps to find a match. The unrolled-loop pattern took only 500 steps. And when I removed the closing bracket from the end tag (</select
), making a match impossible, it required only 800 steps to report failure.
If your regex flavor supports possessive quantifiers, go ahead and use them, too:
<select[^>]*+>[^<]*+(?:<(?!/select>)[^<]*+)*+</select>
It takes about the same number of steps to achieve a match, but it can use a lot less memory in the process. And if no match is possible, it fails even more quickly; in my tests it took about 500 steps, the same number it took to find a match.
Unfortunately this wont work, see the answer by Alan Moore for a correct example!
(<select([^>]*>))(.*+)(</select\s*>)
From the perl regexp manpage:
By default, when a quantified subpattern does not allow the rest of the
overall pattern to match, Perl will backtrack. However, this behaviour
is sometimes undesirable. Thus Perl provides the "possessive"
quantifier form as well.
*+ Match 0 or more times and give nothing back
++ Match 1 or more times and give nothing back
?+ Match 0 or 1 time and give nothing back
{n}+ Match exactly n times and give nothing back (redundant)
{n,}+ Match at least n times and give nothing back
{n,m}+ Match at least n but not more than m times and give nothing back
For instance,
'aaaa' =~ /a++a/
will never match, as the "a++" will gobble up all the "a"'s in the
string and won't leave any for the remaining part of the pattern. This
feature can be extremely useful to give perl hints about where it
shouldn't backtrack. For instance, the typical "match a double-quoted
string" problem can be most efficiently performed when written as:
/"(?:[^"\\]++|\\.)*+"/