I try to understand the non-greedy regex in python, but I don't understand why the following examples have this results:
print(re.search('a??b','aaab').group())
ab
print(re.search('a*?b','aaab').group())
aaab
I thought it would be 'b' for the first and 'ab' for the second.
Can anyone explain that?
This happens because the matches you are asking match afterwards. If you try to follow how the matching for a??b
happens from left to right you'll see something like this:
- Try 0
a
plus b
vs aaab
: no match (b != a
)
- Try 1
a
plus b
vs aaab
: no match (ab != aa
)
- Try 0
a
plus b
vs aab
: no match (b != a
) (match position moved to the right by one)
- Try 1
a
plus b
vs aab
: no match (ab != aa
)
- Try 0
a
plus b
vs ab
: no match (b != a
) (match position moved to the right by one)
- Try 1
a
plus b
vs ab
: match (ab == ab
)
Similarly for *?
.
The fact is that the search
function returns the leftmost match. Using ??
and *?
changes only the behaviour to prefer the shortest leftmost match but it will not return a shorter match that starts at the right of an already found match.
Also note that the re
module doesn't return overlapping matches, so even using findall
or finditer
you will not be able to find the two matches you are looking for.
Its because of that ??
is lazy while ?
is greedy.and a lazy quantifier will match zero or one (its left token), zero if that still allows the overall pattern to match.for example all the following will returns an empty string :
>>> print(re.search('a??','a').group())
>>> print(re.search('a??','aa').group())
>>> print(re.search('a??','aaaa').group())
And the regex a??b
will match ab
or b
:
>>> print(re.search('a??b','aaab').group())
ab
>>> print(re.search('a??b','aacb').group())
b
And if it doesn't allows the overall pattern to match and there was not any b
it will return None :
>>> print(re.search('a??b','aac').group())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
And about the second part you have a none-greedy regex and the result is very obvious.It will match any number of a
and then b
:
print(re.search('a*?b','aaab').group())
aaab
Explanation for the Pattern - /a??b/
a??
matches the character a
literally (case sensitive), Then the quantifier
??
means Between zero and one time, as few times as possible, expanding as needed [lazy], then character b
should match, literally (case sensitive)
So It will match last 'ab'
characters in the given string 'aaab'
And For Pattern - /a*?b/
a*?
matches the character 'a'
literally (case sensitive)
Here the Quantifier *?
means between zero and unlimited times, as few times as possible, expanding as needed [lazy] then character b
should match, literally (case sensitive).
So It will match 'aaab'
as a whole in 'aaab'