I have the following html file:
<!-- <div class="_5ay5"><table class="uiGrid _51mz" cellspacing="0" cellpadding="0"><tbody><tr class="_51mx"><td class="_51m-"><div class="_u3y"><div class="_5asl"><a class="_47hq _5asm" href="/Dev/videos/1610110089242029/" aria-label="Who said it?" ajaxify="/Dev/videos/1610110089242029/" rel="theater">
In order to pull the string of numbers between videos/
and /"
, I'm using the following method that I found:
import re
Source_file = open('source.html').read()
result = re.compile('videos/(.*?)/"').search(Source_file)
print result
I've tried Googling an explanation for exactly how the (.*?)
works in this particular implementation, but I'm still unclear. Could someone explain this to me? Is this what's known as a "non-greedy" match? If yes, what does that mean?
The
?
in this context is a special operator on the repetition operators (+
,*
, and?
). In engines where it is available this causes the repetition to be lazy or non-greedy or reluctant or other such terms. Typically repetition is greedy which means that it should match as much as possible. So you have three types of repetition in most modern perl-compatible engines:More information can be found here: http://www.regular-expressions.info/repeat.html#lazy for reluctant/lazy and here: http://www.regular-expressions.info/possessive.html for possessive (which I'll skip discussing in this answer).
Suppose we have the string
aaaa
. We can match all of the a's with/(a+)a/
. Literally this isThis will match
aaaa
. The regex is greedy and will match as manya
's as possible. The first submatch isaaa
.If we use the regex
/(a+?)a
this isThat is, only match what we need. So in this case the match is
aa
and the first submatch isa
. We only need to match onea
to satisfy the repetition and then it is followed by ana
.This comes up a lot when using regex to match within html tags, quotes and the suchlike -- usually reserved for quick and dirty operations. That is to say using regex to extract from very large and complex html strings or quoted strings with escape sequence can cause a lot of problems but it's perfectly fine for specific use cases. So in your case we have:
The expression needs to match
videos/
followed by zero or more characters followed by/"
. If there is only one videos URL there that's just fine without being reluctant.However we have
Without reluctance, the regex will match:
It tries to match as much as possible and
/
and"
satisfy.
just fine. With reluctance, the matching stops at the first/"
(actually it backtracks but you can read about that separately). Thus you only get the part of the url you need.The
.
means any character. The*
means any number of times, including zero. The?
does indeed mean non-greedy; that means that it will try to capture as few characters as possible, i.e., if the regex encounters a/
, it could match it with the.
, but it would rather not because the.
is non-greedy, and since the next character in the regex is happy to match/
, the.
doesn't have to. If you didn't have the?
, that.
would eat up the whole rest of the file because it would be chomping at the bit to match as many things as possible, and since it matches everything, it would go on forever.It can be explained in a simple way:
.
: match anything (any character),*
: any number of times (at least zero times),?
: as few times as possible (hence non-greedy).as a regular expression matches (for example)
and the first capturing group returns
1610110089242029
, because any of the digits is part of “any character” and there are at least zero characters in it.The
?
causes something like this:to properly match as
1610110089242029
and2387423470237509
instead of as1610110089242029/" something else … "videos/2387423470237509
, hence “as few times as possible”, hence “non-greedy”.