Regular Expression nongreedy is greedy

2019-04-25 11:04发布

I have the following text

tooooooooooooon

According to this book I'm reading, when the ? follows after any quantifier, it becomes non greedy.

My regex to*?n is still returning tooooooooooooon.

It should return ton shouldn't it?

Any idea why?

5条回答
淡お忘
2楼-- · 2019-04-25 11:22

A regular expression es always eager to match.

Your expression says this:

A 't', followed by *as few as possible* 'o's, followed by a 'n'.

That means any o's necessary will be matched, because there is an 'n' at the end, which the expression is eager to reach. Matching all the o's is it's only possibility to succeed.

查看更多
男人必须洒脱
3楼-- · 2019-04-25 11:26

A regular expression can only match a fragment of text that actually exists.

Because the substring 'ton' doesn't exist anywhere in your string, it can't be the result of a match. A match will only return a substring of the original string

EDIT: To be clear, if you were using the string below, with an extra 'n'

toooooooonoooooon

this regular expression (which doesn't specify 'o's)

t.*n

would match the following (as many characters as possible before an 'n')

toooooooonoooooon

but the regular expression

t.*?n

would only match the following (as few characters as possible before an 'n')

toooooooon
查看更多
成全新的幸福
4楼-- · 2019-04-25 11:26

The string you are searching in (the haystack as it were) does not contain the substring "ton".

It does however contain the substring "tooooooooooooon".

查看更多
We Are One
5楼-- · 2019-04-25 11:33

Regexps try to match everything in them. Because there are no less 'o's to match than every o in toooon to match the n, everything is matched. Also, because you are using o*? instead of o+? you are not requiring an o to be present.

Example, in Perl

$a = "toooooo";
$b = "toooooon";

if ($a =~ m/(to*?)/) {
        print $1,"\n";
}
if ($b =~ m/(to*?n)/) {
        print $1,"\n";
}

~>perl ex.pl
t
toooooon
查看更多
兄弟一词,经得起流年.
6楼-- · 2019-04-25 11:33

The Regex always does its best to match. The only thing you are doing in this case would be slowing your parser down, by having it backtrack into the /o*?/ node. Once for every single 'o' in "tooooon". Whereas with normal matching, it would take as many 'o's, as it can, the first time through. Since the next element to match against is 'n', which won't be matched by 'o', there is little point in trying to use minimal matching. Actually, when the normal matching fails, it would take quite a while for it to fail. It has to backtrack through every 'o', until there is none left to backtrack through. In this case I would actually use maximal matching /to*+n/. The 'o' would take all it could, and never give any of it back. This would make it so that when it fails it fails quickly.

Minimal RE succeeding:

'toooooon' ~~ /to*?n/

 t  o  o  o  o  o  o  n       
{t}                           match [t]
[t]                           match [o] 0 times
[t]<n>                        fail to match [n] -> retry [o]
[t]{o}                        match [o] 1 times
[t][o]<n>                     fail to match [n] -> retry [o]
[t][o]{o}                     match [o] 2 times
[t][o][o]<n>                  fail to match [n] -> retry [o]

. . . .

[t][o][o][o][o]{o}            match [o] 5 times
[t][o][o][o][o][o]<n>         fail to match [n] -> retry [o]
[t][o][o][o][o][o]{o}         match [o] 6 times
[t][o][o][o][o][o][o]{n}      match [n]

Normal RE succeeding:

(NOTE: Similar for Maximal RE)

'toooooon' ~~ /to*n/

 t  o  o  o  o  o  o  n       
{t}                           match [t]
[t]{o}{o}{o}{o}{o}{o}         match [o] 6 times
[t][o][o][o][o][o][o]{n}      match [n]

Failure of Minimal RE:

'toooooo' ~~ /to*?n/

 t  o  o  o  o  o  o

. . . .

. . . .

[t][o][o][o][o]{o}            match [o] 5 times
[t][o][o][o][o][o]<n>         fail to match [n] -> retry [o]
[t][o][o][o][o][o]{o}         match [o] 6 times
[t][o][o][o][o][o][o]<n>      fail to match [n] -> retry [o]
[t][o][o][o][o][o][o]<o>      fail to match [o] 7 times -> match failed

Failure of Normal RE:

'toooooo' ~~ /to*n/

 t  o  o  o  o  o  o       
{t}                           match [t]
[t]{o}{o}{o}{o}{o}{o}         match [o] 6 times
[t][o][o][o][o][o][o]<n>      fail to match [n] -> retry [o]
[t][o][o][o][o][o]            match [o] 5 times
[t][o][o][o][o][o]<n>         fail to match [n] -> retry [o]

. . . .

[t][o]                        match [o] 1 times
[t][o]<o>                     fail to match [n] -> retry [o]
[t]                           match [o] 0 times
[t]<n>                        fail to match [n] -> match failed

Failure of Maximal RE:

'toooooo' ~~ /to*+n/

 t  o  o  o  o  o  o
{t}                           match [t]
[t]{o}{o}{o}{o}{o}{o}         match [o] 6 times
[t][o][o][o][o][o][o]<n>      fail to match [n] -> match failed
查看更多
登录 后发表回答