<table((?!</table>).)*</table>
matches all my table tags, however,
<table(.(?!</table>))*</table>
does not. The second one seems to make sense if I try to write out the expression in words, but I can't make sense of the first.
Can someone explain the difference to me?
For reference, I got the term `Tempered Greedy Token' from here: http://www.rexegg.com/regex-quantifiers.html#tempered_greed
((?!</table>).)*
would checks for that particular character going to be matched must not be a starting character in the string</table>
. If yes, then only it matches that particular character.*
repeats the same zero or more times.(.(?!</table>))*
matches any character only if it's not followed by</table>
, zero or more times . So this would match all the chars inside the table tag excpet the last character, since the last char is followed by</table>
. And the following pattern</table>
asserts that there must be a closing table tag at the end of the match. This makes the match to fail.See here
A tempered greedy token really just means:
how you do it:
how it works:
Wiktor's more detailed answer is nice, I just thought a simpler explanation was in order.
Since Google returns this SO question on top of the results for the
tempered greedy token
, I feel obliged to provide a more comprehensive answer.What is a Tempered Greedy Token?
The rexegg.com tempered greedy token reference is quite concise:
That is it: a tempered greedy token is a kind of a negated character class for a character sequence (cf. negated character class for a single character).
NOTE: The difference between a tempered greedy token and a negated character class is that the former does not really match the text other than the sequence itself, but a single character that does not start that sequence. I.e.
(?:(?!abc|xyz).)+
won't matchdef
indefabc
, but will matchdef
andbc
, becausea
starts the forbiddenabc
sequence, andbc
does not.It consists of:
(?:...)*
- a quantified non-capturing group (it may be a capturing group, but it makes no sense to capture each individual character) (a*
can be+
, it depends on whether an empty string match is expected)(?!...)
- a negative lookahead that actually imposes a restriction on the value to the right of the current location.
- (or any (usually single) character) a consuming pattern.However, we can always further temper the token by using alternations in the negative lookahead (e.g.
(?!{(?:END|START|MID)})
) or by replacing the all-matching dot with a negated character class (e.g.(?:(?!START|END|MID)[^<>])
when trying to match text only inside tags).Consuming part placement
Note there is no mentioning of a construction where a consuming part (the dot in the original tempered greedy token) is placed before the lookahead. Avinash's answer is explaining that part clearly:
(.(?!</table>))*
first matches any character (but a newline without a DOTALL modifier) and then checks if it is not followed with</table>
resulting in a failure to matche
in<table>table</table>
. The consuming part (the.
) MUST be placed after the tempering lookahead.When to use tempered greedy token?
Rexegg.com gives an idea:
{START}(?:(?!{(?:MID|RESTART)}).)*?{END}
<table>.*?chair.*?</table>
, we'd use something like<table>(?:(?!chair|</?table>).)*chair(?:(?!<table>).)*</table>
).abc 2 xyz
fromabc 1 abc 2 xyz
(seeabc.*?xyz
andabc(?:(?!abc).)*?xyz
).Performance Issue
Tempered greedy token is resource-consuming as a lookahead check is performed after each character matched with the consuming pattern. Unrolling the loop technique can significantly increase tempered greedy token performance.
Say, we want to match
abc 2 xyz
in abc 1 abc 2 xyz 3 xyz. Instead of checking each character betweenabc
andxyz
withabc(?:(?!abc|xyz).)*xyz
, we can skip all characters that are nota
orx
with[^ax]*
, and then match alla
that are not followed withbc
(witha(?!bc)
) and allx
that are not followed withyz
(withx(?!yz)
):abc[^ax]*(?:a(?!bc)[^ax]*|x(?!yz)[^ax]*)*xyz
.