I am having a problem with a non-greedy regular expression. I've seen that there are questions regarding non-greedy regex, but they don't answer to my problem.
Problem: I am trying to match the href of the "lol" anchor.
Note: I know this can be done with perl HTML parsing modules, and my question is not about parsing HTML in perl. My question is about the regular expression itself and the HTML is just an example.
Test case: I have 4 tests for .*?
and [^"]
. The 2 first produce the expected result. However the 3rd doesn't and the 4th just does but I don't understand why.
Questions:
- Why does the 3rd test fail in both tests for
.*?
and[^"]
? Shouldn't the non-greedy operator work? - Why does the 4th test works in both tests for
.*?
and[^"]
? I don't understand why including a.*
in front changes the regex. (the 3rd and 4th tests are the same except the.*
in front).
I probably don't understand exactly how these regex work. A perl cookbook recipe mentions something but I don't think it answers my question.
use strict;
my $content=<<EOF;
<a href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a>
<a href="/foo/foo/foo/foo/foo" class="foo">foo </a>
<a href="/bar/bar/bar/bar/bar" class="bar">bar</a>
<a href="/lol/lol/lol/lol/lol" class="lol">lol</a>
<a href="/koo/koo/koo/koo/koo" class="koo">koo</a>
EOF
print "| $1 | \n\nThat's ok\n" if $content =~ m~href="(.*?)"~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nThat's ok\n" if $content =~ m~href="(.*?)".*>lol~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nWhy does not the 2nd non-greedy '?' work?\n"
if $content =~ m~href="(.*?)".*?>lol~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nIt now works if I put the '.*' in the front?\n"
if $content =~ m~.*href="(.*?)".*?>lol~s ;
print "\n###################################################\n";
print "Let's try now with [^]";
print "\n###################################################\n\n";
print "| $1 | \n\nThat's ok\n" if $content =~ m~href="([^"]+?)"~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nThat's ok.\n" if $content =~ m~href="([^"]+?)".*>lol~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nThe 2nd greedy still doesn't work?\n"
if $content =~ m~href="([^"]+?)".*?>lol~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nNow with the '.*' in front it does.\n"
if $content =~ m~.*href="([^"]+?)".*?>lol~s ;
The main problem is that you are using non-greedy regexes when you shouldn't. The second problem is using . with * which can accidentally match more that you intended to. The s flag you are using make . even more matching.
Use:
for your case. And about non-greedy regexes, consider that code:
It would not match 'xaaac' as you might expect, it will start from the beginning of the string and match 'xaaaaab xaaac'. A greedy variant would match the whole string.
The point is that though non-greedy regexes don't try to grab as much as they can, they still try to match somehow with the same eagerness as their greedy brothers. And they will grab whatever part of a string to do it.
You may also consider "possessive" quantifier, which turns off backtracking. Also, cookbooks are good to start, but if you want to understand how things really work you should read this - perlre
Try printing out
$&
(the text matched by the entire regex) as well as$1
. This may give you a better idea of what's happening.The problem you seem to have is that
.*?
does not mean "Find the match out of all possible matches that uses the fewest characters here." It just means "First, try matching 0 characters here, and go on to match the rest of the regex. If that fails, try matching 1 character. If the rest of the regex won't match, try 2 characters here. etc."Perl will always find the match that starts closest to the beginning of the string. Since most of your patterns start with
href=
, it will find the firsthref=
in the string and see if there's any way to expand the repetitions to get a match beginning there. If it can't get a match, it'll try starting at the nexthref=
, and so on.When you add a greedy
.*
to the beginning of the regex, matching starts with the.*
grabbing as many characters as it can. Perl then backtracks to find ahref=
. Essentially, this causes it to try the lasthref=
in the string first, and work towards the beginning of the string.Let me try to illustrate what's going on here (see other answers why it's happens):
href="(.*?)"
Match:
href="/hoh/hoh/hoh/hoh/hoh"
Group:/hoh/hoh/hoh/hoh/hoh
href="(.*?)".*>lol
Match:
href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a> <a href="/foo/foo/foo/foo/foo" class="foo">foo </a> <a href="/bar/bar/bar/bar/bar" class="bar">bar</a> <a href="/lol/lol/lol/lol/lol" class="lol">lol
Group:
/hoh/hoh/hoh/hoh/hoh
href="([^"]+?)".*?>lol
Match:
href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a> <a href="/foo/foo/foo/foo/foo" class="foo">foo </a> <a href="/bar/bar/bar/bar/bar" class="bar">bar</a> <a href="/lol/lol/lol/lol/lol" class="lol">lol
Group:
/hoh/hoh/hoh/hoh/hoh
.*href="(.*?)".*?>lol
Match:
<a href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a> <a href="/foo/foo/foo/foo/foo" class="foo">foo </a> <a href="/bar/bar/bar/bar/bar" class="bar">bar</a> <a href="/lol/lol/lol/lol/lol" class="lol">lol
Group:
/lol/lol/lol/lol/lol
One way to write regex you want is to use:
href="[^"]*"[^>]*>lol
Only the 4th test case is working.
the first
m~href="(.*?)"~s
This will match the first href within your string and capture what is between the quotes so:/hoh/hoh/hoh/hoh/hoh
The second :
m~href="(.*?)".*>lol~s
This will match the first href within your string and capture what is between the quotes, then match any any number of any character until it finds>lol
so:/hoh/hoh/hoh/hoh/hoh
Try capturing the
.*
withm~href="(.*?)"(.*)>lol~s
the third :
m~href="(.*?)".*?>lol~s
Same result as the previous test case.The fourth :
m~.*href="(.*?)".*?>lol~s
This will match any number of any character thenhref="
then capture any number of any character non-greedy until the quote, then match any any number of any character until it finds>lol
so:/lol/lol/lol/lol/lol
Try capturing all the
.*
withm~(.*)href="(.*?)"(.*?)>lol~s
Have a look at this site it explains what your regexes are doing.