Understanding regex pattern used to find string be

I have the following html file:

<!-- <div class="_5ay5"><table class="uiGrid _51mz" cellspacing="0" cellpadding="0"><tbody><tr class="_51mx"><td class="_51m-"><div class="_u3y"><div class="_5asl"><a class="_47hq _5asm" href="/Dev/videos/1610110089242029/" aria-label="Who said it?" ajaxify="/Dev/videos/1610110089242029/" rel="theater">

In order to pull the string of numbers between videos/ and /", I'm using the following method that I found:

import re 

Source_file = open('source.html').read()
result = re.compile('videos/(.*?)/"').search(Source_file)
print result

I've tried Googling an explanation for exactly how the (.*?) works in this particular implementation, but I'm still unclear. Could someone explain this to me? Is this what's known as a "non-greedy" match? If yes, what does that mean?

标签： python regex python-2.7 non-greedy

3条回答

爱情/是我丢掉的垃圾

2楼-- · 2019-09-08 05:33

The ? in this context is a special operator on the repetition operators (+, *, and ?). In engines where it is available this causes the repetition to be lazy or non-greedy or reluctant or other such terms. Typically repetition is greedy which means that it should match as much as possible. So you have three types of repetition in most modern perl-compatible engines:

.*  # Match any character zero or more times
.*? # Match any character zero or more times until the next match (reluctant)
.*+ # Match any character zero or more times and don't stop matching! (possessive)

More information can be found here: http://www.regular-expressions.info/repeat.html#lazy for reluctant/lazy and here: http://www.regular-expressions.info/possessive.html for possessive (which I'll skip discussing in this answer).

Suppose we have the string aaaa. We can match all of the a's with /(a+)a/. Literally this is

match one or more a's followed by an a.

This will match aaaa. The regex is greedy and will match as many a's as possible. The first submatch is aaa.

If we use the regex /(a+?)a this is

reluctantly match one or more as followed by an a
or
match one or more as until we reach another a

That is, only match what we need. So in this case the match is aa and the first submatch is a. We only need to match one a to satisfy the repetition and then it is followed by an a.

This comes up a lot when using regex to match within html tags, quotes and the suchlike -- usually reserved for quick and dirty operations. That is to say using regex to extract from very large and complex html strings or quoted strings with escape sequence can cause a lot of problems but it's perfectly fine for specific use cases. So in your case we have:

/Dev/videos/1610110089242029/

The expression needs to match videos/ followed by zero or more characters followed by /". If there is only one videos URL there that's just fine without being reluctant.

However we have

/videos/1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029/"

Without reluctance, the regex will match:

1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029

It tries to match as much as possible and / and " satisfy . just fine. With reluctance, the matching stops at the first /" (actually it backtracks but you can read about that separately). Thus you only get the part of the url you need.

0人赞添加讨论(0) 举报

Summer. ? 凉城

3楼-- · 2019-09-08 05:33

The . means any character. The * means any number of times, including zero. The ? does indeed mean non-greedy; that means that it will try to capture as few characters as possible, i.e., if the regex encounters a /, it could match it with the ., but it would rather not because the . is non-greedy, and since the next character in the regex is happy to match /, the . doesn't have to. If you didn't have the ?, that . would eat up the whole rest of the file because it would be chomping at the bit to match as many things as possible, and since it matches everything, it would go on forever.

0人赞添加讨论(0) 举报

Root（大扎）

4楼-- · 2019-09-08 05:57

It can be explained in a simple way:

.: match anything (any character),
*: any number of times (at least zero times),
?: as few times as possible (hence non-greedy).

videos/(.*?)/"

as a regular expression matches (for example)

videos/1610110089242029/"

and the first capturing group returns 1610110089242029, because any of the digits is part of “any character” and there are at least zero characters in it.

The ? causes something like this:

videos/1610110089242029/" something else … "videos/2387423470237509/"

to properly match as 1610110089242029 and 2387423470237509 instead of as 1610110089242029/" something else … "videos/2387423470237509, hence “as few times as possible”, hence “non-greedy”.

0人赞添加讨论(0) 举报

Understanding regex pattern used to find string be

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间