I wanted to capture a stream of digits which are not followed by certain digits. For example
input = abcdef lookbehind 123456..... asjdnasdh lookbehind 789432
I want to capture 789432 and not 123 using negative lookahead only.
I tried (?<=lookbehind )([\d])+(?!456)
but it captures 123456
and 789432
.
Using (?<=lookbehind )([\d])+?(?!456)
captures only 1
and 7
.
Grouping is not an option for me as my use case doesn't allow me to do it.
Is there any way I can capture 789432
and not 123
using pure regex?
An explanation for the answer is appreciated.
You may use a possessive quantifier with a negative lookbehind
(?<=lookbehind )\d++(?<!456)
^^ ^^^^^^
See this regex demo.
A synonymous pattern with an atomic group:
(?<=lookbehind )(?>\d+)(?<!456)
Details
(?<=lookbehind )
- a positive lookbehind that matches a location in string that is immediately preceded with lookbehind
\d++
- 1+ digits matched possessively, allowing no backtracking into the pattern (the engine cannot retry matching from any digit matched with \d++
)
(?<!456)
- a negative lookbehind check that fails the match if the last 3 digits matched with \d++
are 456
.
Why lookbehind and why not lookahead
The negative lookbehind (?<!...)
makes sure that a certain pattern does not match immediately to the left of the current location. A negative lookahead (?!...)
fails the match if its pattern matches immediately to the right of the current location. "Fail" here means that the regex engine abandons the current way of matching a string, and if there are quantified patterns before the lookbehind/lookahead the engine might backtrack into those patterns to try and match a string differently. Note that here, a possessive quantifier makes it impossible for the engine to perform the lookbehind check for 456
multiple times, it is only executed once all the digits are grabbed with \d++
.
You (?<=lookbehind )([\d])+(?!456)
regex matches 123456
because the \d+
matches these digits in a greedy way (all at once) and (?!456)
checks for 456
after them, and since there are no 456
there, the match is returned. The (?<=lookbehind )([\d])+?(?!456)
matches only one digit because \d+?
matches in a lazy way, 1 digit is matched and then the loolahead check is performed. Since there is no 456
after 1
, 1
is returned.
why ++
possessive quantifier
It does not allow a regex engine to retry matching a string differently if there are quantified patterns before. So, (?<=lookbehind )\d+(?<!456)
matches 12345
in 123456
as there is no 456
before 6
.
You may use a negative lookbehind as well:
(?<=lookbehind )\d+\b(?<!456)
RegEx Demo
RegEx Details:
(?<=lookbehind )
: Positive lookbehind to assert that we have "lookbehind "
before current position
\d+\b
: Match 1+ digits followed by word boundary
(?<!456)
: Negative lookbehind to assert that we don't have 456
before current position
Alternative solution using a negative lookahead:
(?<=lookbehind )(?!\d*456)\d+
RegEx Demo 2
We need \d*
in lookahead expression (?!\d*456)
so that we can skip 456
after matching 0 or more digits from current position.