I have the following text
tooooooooooooon
According to this book I'm reading, when the ?
follows after any quantifier, it becomes non greedy.
My regex to*?n
is still returning tooooooooooooon
.
It should return ton
shouldn't it?
Any idea why?
A regular expression can only match a fragment of text that actually exists.
Because the substring 'ton' doesn't exist anywhere in your string, it can't be the result of a match. A match will only return a substring of the original string
EDIT: To be clear, if you were using the string below, with an extra 'n'
toooooooonoooooon
this regular expression (which doesn't specify 'o's)
t.*n
would match the following (as many characters as possible before an 'n')
toooooooonoooooon
but the regular expression
t.*?n
would only match the following (as few characters as possible before an 'n')
toooooooon
A regular expression es always eager to match.
Your expression says this:
A 't', followed by *as few as possible* 'o's, followed by a 'n'.
That means any o's necessary will be matched, because there is an 'n' at the end, which the expression is eager to reach. Matching all the o's is it's only possibility to succeed.
Regexps try to match everything in them. Because there are no less 'o's to match than every o in toooon to match the n, everything is matched. Also, because you are using o*? instead of o+? you are not requiring an o to be present.
Example, in Perl
$a = "toooooo";
$b = "toooooon";
if ($a =~ m/(to*?)/) {
print $1,"\n";
}
if ($b =~ m/(to*?n)/) {
print $1,"\n";
}
~>perl ex.pl
t
toooooon
The Regex always does its best to match. The only thing you are doing in this case would be slowing your parser down, by having it backtrack into the /o*?/
node. Once for every single 'o'
in "tooooon"
. Whereas with normal matching, it would take as many 'o'
s, as it can, the first time through. Since the next element to match against is 'n'
, which won't be matched by 'o'
, there is little point in trying to use minimal matching. Actually, when the normal matching fails, it would take quite a while for it to fail. It has to backtrack through every 'o'
, until there is none left to backtrack through. In this case I would actually use maximal matching /to*+n/
. The 'o'
would take all it could, and never give any of it back. This would make it so that when it fails it fails quickly.
Minimal RE succeeding:
'toooooon' ~~ /to*?n/
t o o o o o o n
{t} match [t]
[t] match [o] 0 times
[t]<n> fail to match [n] -> retry [o]
[t]{o} match [o] 1 times
[t][o]<n> fail to match [n] -> retry [o]
[t][o]{o} match [o] 2 times
[t][o][o]<n> fail to match [n] -> retry [o]
. . . .
[t][o][o][o][o]{o} match [o] 5 times
[t][o][o][o][o][o]<n> fail to match [n] -> retry [o]
[t][o][o][o][o][o]{o} match [o] 6 times
[t][o][o][o][o][o][o]{n} match [n]
Normal RE succeeding:
(NOTE: Similar for Maximal RE)
'toooooon' ~~ /to*n/
t o o o o o o n
{t} match [t]
[t]{o}{o}{o}{o}{o}{o} match [o] 6 times
[t][o][o][o][o][o][o]{n} match [n]
Failure of Minimal RE:
'toooooo' ~~ /to*?n/
t o o o o o o
. . . .
. . . .
[t][o][o][o][o]{o} match [o] 5 times
[t][o][o][o][o][o]<n> fail to match [n] -> retry [o]
[t][o][o][o][o][o]{o} match [o] 6 times
[t][o][o][o][o][o][o]<n> fail to match [n] -> retry [o]
[t][o][o][o][o][o][o]<o> fail to match [o] 7 times -> match failed
Failure of Normal RE:
'toooooo' ~~ /to*n/
t o o o o o o
{t} match [t]
[t]{o}{o}{o}{o}{o}{o} match [o] 6 times
[t][o][o][o][o][o][o]<n> fail to match [n] -> retry [o]
[t][o][o][o][o][o] match [o] 5 times
[t][o][o][o][o][o]<n> fail to match [n] -> retry [o]
. . . .
[t][o] match [o] 1 times
[t][o]<o> fail to match [n] -> retry [o]
[t] match [o] 0 times
[t]<n> fail to match [n] -> match failed
Failure of Maximal RE:
'toooooo' ~~ /to*+n/
t o o o o o o
{t} match [t]
[t]{o}{o}{o}{o}{o}{o} match [o] 6 times
[t][o][o][o][o][o][o]<n> fail to match [n] -> match failed
The string you are searching in (the haystack as it were) does not contain the substring "ton".
It does however contain the substring "tooooooooooooon".