perl non-greedy problem

2019-06-27 03:07发布

问题:

I am having a problem with a non-greedy regular expression. I've seen that there are questions regarding non-greedy regex, but they don't answer to my problem.

Problem: I am trying to match the href of the "lol" anchor.

Note: I know this can be done with perl HTML parsing modules, and my question is not about parsing HTML in perl. My question is about the regular expression itself and the HTML is just an example.

Test case: I have 4 tests for .*? and [^"]. The 2 first produce the expected result. However the 3rd doesn't and the 4th just does but I don't understand why.

Questions:

  1. Why does the 3rd test fail in both tests for .*? and [^"] ? Shouldn't the non-greedy operator work?
  2. Why does the 4th test works in both tests for .*? and [^"] ? I don't understand why including a .* in front changes the regex. (the 3rd and 4th tests are the same except the .* in front).

I probably don't understand exactly how these regex work. A perl cookbook recipe mentions something but I don't think it answers my question.

use strict;

my $content=<<EOF;
<a href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a>
<a href="/foo/foo/foo/foo/foo" class="foo">foo </a>
<a href="/bar/bar/bar/bar/bar" class="bar">bar</a>
<a href="/lol/lol/lol/lol/lol" class="lol">lol</a>
<a href="/koo/koo/koo/koo/koo" class="koo">koo</a>
EOF

print "| $1 | \n\nThat's ok\n" if $content =~ m~href="(.*?)"~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nThat's ok\n" if $content =~ m~href="(.*?)".*>lol~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nWhy does not the 2nd non-greedy '?' work?\n"
  if $content =~ m~href="(.*?)".*?>lol~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nIt now works if I put the '.*' in the front?\n"
  if $content =~ m~.*href="(.*?)".*?>lol~s ;

print "\n###################################################\n";
print "Let's try now with [^]";
print "\n###################################################\n\n";


print "| $1 | \n\nThat's ok\n" if $content =~ m~href="([^"]+?)"~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nThat's ok.\n" if $content =~ m~href="([^"]+?)".*>lol~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nThe 2nd greedy still doesn't work?\n"
  if $content =~ m~href="([^"]+?)".*?>lol~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nNow with the '.*' in front it does.\n"
  if $content =~ m~.*href="([^"]+?)".*?>lol~s ;

回答1:

Try printing out $& (the text matched by the entire regex) as well as $1. This may give you a better idea of what's happening.

The problem you seem to have is that .*? does not mean "Find the match out of all possible matches that uses the fewest characters here." It just means "First, try matching 0 characters here, and go on to match the rest of the regex. If that fails, try matching 1 character. If the rest of the regex won't match, try 2 characters here. etc."

Perl will always find the match that starts closest to the beginning of the string. Since most of your patterns start with href=, it will find the first href= in the string and see if there's any way to expand the repetitions to get a match beginning there. If it can't get a match, it'll try starting at the next href=, and so on.

When you add a greedy .* to the beginning of the regex, matching starts with the .* grabbing as many characters as it can. Perl then backtracks to find a href=. Essentially, this causes it to try the last href= in the string first, and work towards the beginning of the string.



回答2:

Only the 4th test case is working.

the first m~href="(.*?)"~s This will match the first href within your string and capture what is between the quotes so: /hoh/hoh/hoh/hoh/hoh

The second : m~href="(.*?)".*>lol~s This will match the first href within your string and capture what is between the quotes, then match any any number of any character until it finds >lol so: /hoh/hoh/hoh/hoh/hoh

Try capturing the .* with m~href="(.*?)"(.*)>lol~s

$1 contains :
/hoh/hoh/hoh/hoh/hoh
$2 contains : 
class="hoh">hoh</a>
<a href="/foo/foo/foo/foo/foo" class="foo">foo </a>
<a href="/bar/bar/bar/bar/bar" class="bar">bar</a>
<a href="/lol/lol/lol/lol/lol" class="lol" 

the third : m~href="(.*?)".*?>lol~s Same result as the previous test case.

The fourth : m~.*href="(.*?)".*?>lol~s This will match any number of any character then href=" then capture any number of any character non-greedy until the quote, then match any any number of any character until it finds >lol so: /lol/lol/lol/lol/lol

Try capturing all the .* with m~(.*)href="(.*?)"(.*?)>lol~s

$1 contains :
<a href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a>
<a href="/foo/foo/foo/foo/foo" class="foo">foo </a>
<a href="/bar/bar/bar/bar/bar" class="bar">bar</a>
<a
$2 contains : 
/lol/lol/lol/lol/lol
$3 contains :
class="lol"

Have a look at this site it explains what your regexes are doing.



回答3:

The main problem is that you are using non-greedy regexes when you shouldn't. The second problem is using . with * which can accidentally match more that you intended to. The s flag you are using make . even more matching.

Use:

m~href="([^"]+)"[^>]*>lol~

for your case. And about non-greedy regexes, consider that code:

$_ = "xaaaaab xaaac xbbc";
m~^x.+?c~;

It would not match 'xaaac' as you might expect, it will start from the beginning of the string and match 'xaaaaab xaaac'. A greedy variant would match the whole string.

The point is that though non-greedy regexes don't try to grab as much as they can, they still try to match somehow with the same eagerness as their greedy brothers. And they will grab whatever part of a string to do it.

You may also consider "possessive" quantifier, which turns off backtracking. Also, cookbooks are good to start, but if you want to understand how things really work you should read this - perlre



回答4:

Let me try to illustrate what's going on here (see other answers why it's happens):

href="(.*?)"

Match: href="/hoh/hoh/hoh/hoh/hoh" Group: /hoh/hoh/hoh/hoh/hoh

href="(.*?)".*>lol

Match: href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a> <a href="/foo/foo/foo/foo/foo" class="foo">foo </a> <a href="/bar/bar/bar/bar/bar" class="bar">bar</a> <a href="/lol/lol/lol/lol/lol" class="lol">lol

Group: /hoh/hoh/hoh/hoh/hoh

href="([^"]+?)".*?>lol

Match: href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a> <a href="/foo/foo/foo/foo/foo" class="foo">foo </a> <a href="/bar/bar/bar/bar/bar" class="bar">bar</a> <a href="/lol/lol/lol/lol/lol" class="lol">lol

Group: /hoh/hoh/hoh/hoh/hoh

.*href="(.*?)".*?>lol

Match: <a href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a> <a href="/foo/foo/foo/foo/foo" class="foo">foo </a> <a href="/bar/bar/bar/bar/bar" class="bar">bar</a> <a href="/lol/lol/lol/lol/lol" class="lol">lol

Group: /lol/lol/lol/lol/lol

One way to write regex you want is to use: href="[^"]*"[^>]*>lol