Capture stream of digits which is not followed by

2019-06-07 05:05发布

问题:

I wanted to capture a stream of digits which are not followed by certain digits. For example

input = abcdef lookbehind 123456..... asjdnasdh lookbehind 789432

I want to capture 789432 and not 123 using negative lookahead only.

I tried (?<=lookbehind )([\d])+(?!456) but it captures 123456 and 789432.

Using (?<=lookbehind )([\d])+?(?!456) captures only 1 and 7.

Grouping is not an option for me as my use case doesn't allow me to do it.

Is there any way I can capture 789432 and not 123 using pure regex? An explanation for the answer is appreciated.

回答1:

You may use a possessive quantifier with a negative lookbehind

(?<=lookbehind )\d++(?<!456)
                  ^^ ^^^^^^ 

See this regex demo.

A synonymous pattern with an atomic group:

(?<=lookbehind )(?>\d+)(?<!456)

Details

  • (?<=lookbehind ) - a positive lookbehind that matches a location in string that is immediately preceded with lookbehind
  • \d++ - 1+ digits matched possessively, allowing no backtracking into the pattern (the engine cannot retry matching from any digit matched with \d++)
  • (?<!456) - a negative lookbehind check that fails the match if the last 3 digits matched with \d++ are 456.

Why lookbehind and why not lookahead

The negative lookbehind (?<!...) makes sure that a certain pattern does not match immediately to the left of the current location. A negative lookahead (?!...) fails the match if its pattern matches immediately to the right of the current location. "Fail" here means that the regex engine abandons the current way of matching a string, and if there are quantified patterns before the lookbehind/lookahead the engine might backtrack into those patterns to try and match a string differently. Note that here, a possessive quantifier makes it impossible for the engine to perform the lookbehind check for 456 multiple times, it is only executed once all the digits are grabbed with \d++.

You (?<=lookbehind )([\d])+(?!456) regex matches 123456 because the \d+ matches these digits in a greedy way (all at once) and (?!456) checks for 456 after them, and since there are no 456 there, the match is returned. The (?<=lookbehind )([\d])+?(?!456) matches only one digit because \d+? matches in a lazy way, 1 digit is matched and then the loolahead check is performed. Since there is no 456 after 1, 1 is returned.

why ++ possessive quantifier

It does not allow a regex engine to retry matching a string differently if there are quantified patterns before. So, (?<=lookbehind )\d+(?<!456) matches 12345 in 123456 as there is no 456 before 6.



回答2:

You may use a negative lookbehind as well:

(?<=lookbehind )\d+\b(?<!456)

RegEx Demo

RegEx Details:

  • (?<=lookbehind ): Positive lookbehind to assert that we have "lookbehind " before current position
  • \d+\b: Match 1+ digits followed by word boundary
  • (?<!456): Negative lookbehind to assert that we don't have 456 before current position

Alternative solution using a negative lookahead:

(?<=lookbehind )(?!\d*456)\d+

RegEx Demo 2

We need \d* in lookahead expression (?!\d*456) so that we can skip 456 after matching 0 or more digits from current position.