In some Rails code (cucumber features' steps definitions, javascripts, rails_admin
gem) I found this regular expression parts:
string =~ /some regexp.+rules should match "(.*?)"/i
I do have some knowledge at regular expressions and i know that *
and ?
symbols are similar but whilst asterisk means zero and more
, the question mark means could be present or could be not
.
So, using the question mark near the group of symbols makes its presence non-required within the phrase being tested. What's the... well... the trick of using it near the non-required already group (skipping requirement is made using the asterisk afaik)?
Right after a quantifier (like
*
), the?
has a different meaning and makes it "ungreedy". So while the default is that*
consumes as much as possible,*?
matches as little as possible.In your specific case, this is relevant for strings like this:
Without the question mark the regex matches the full string (because
.*
can consume"
just like anything else) andsome string" or "another
is captured. With the use of the question mark, the match will stop as soon as possible, (so after...some string"
) and will capture onlysome string
.Further reading.
It makes the search non-greedy. That means, it will settle for the shortest possible match, not the longest.
Consider this string
"<person>1</person><person>2</person>"
the regex
<person>.*</person>
would match<person>1</person><person>2</person>
So,
.*
is greedy..the regex
<person>.*?</person>
would match<person>1</person>
and<person>2</person>
in the next matchSo,
.*?
is lazy..?
has dual meaning.means the last
o
can be there zero or one times.means the last
o
will be there zero or many times, but select the minimum number, i.e., it's non-greedy.These might help explain:
The
non-greedy
use of?
is unfortunate I think. They reused an operator we expected to have a single meaning "zero or one" and threw it at us in a way that can really be difficult to decipher.But, the need was genuine: Too many times we'd write a pattern that would go wildly wrong, gobbling everything in sight, because the regex engine was doing what we said with unforeseen character patterns. Regex can be very complex and convoluted, but the "non-greedy" use of
?
helps tame that. Sometimes, using it is the sloppy or quick-n-dirty way out but we don't have time to rewrite the pattern to do it correctly. Sometimes it's the magic bullet and was elegant. I think which it is depends on whether you're under a deadline and writing code to get something done, or you're debugging years after the fact and finally found that?
wasn't the optimal fix.