Star vs. plus quantifier in the variable-width neg

2019-06-07 06:02发布

问题:

Silly question here... I'm trying to match white-space inside the line while ignoring the leading spaces/tabs and came up with these regex strings, but I can't figure out why only one is working (C# regex engine):

(?<!^[ \t]*)[ \t]+       // regex 1. (with *)
(?<!^[ \t]+)[ \t]+       // regex 2. (with +)

Note the star vs. plus repetitions in the negative look-ahead. When matching these against " word1 word2" (2 leading spaces):

⎵⎵word1⎵word2             
      ^                  // 1 match for regex 1. (*)

⎵⎵word1⎵word2             
^^     ^                 // 2 matches for regex 2. (+)
 ^     ^                 // why not match like this?

Why does only version 1. (star) work here and version 2. (plus) not match the second leading space?

I presume that it's because of the higher priority of the greedy + from [ \t]+ over the look-ahead's, but how can I rationalize to expect this?

回答1:

In short:

The negative lookbehind just checks if the current position is not preceded with the lookbehind pattern and the result of the check is either true (yes, go on matching) or false (stop processing the pattern, go for the next match). The check is not affecting the regex index, the engine remains at one and the same location after performing the check.

In the current expressions, the lookbehind pattern is checked first (as the pattern is parsed from left to right, not vice versa), and only if the lookbehind check returns true the [ \t]+ pattern is tried. In the first expression, the negative lookbehind returns false as the lookbehind pattern finds a match (the start of string). The second expression negative lookbehind returns true because there is no start of string followed with 1 or more spaces/tabs at the beginning of a string.

Here is the logic behind the 2 expressions:

  • The lookbehind check is performed first. In the first expression, (?<!^[ \t]*) is trying to match at the beginning of a string. A beginning of a string has no beginning of a string (^) followed with 0+ spaces or tabs. It is important to note that a lookbehind implementation in .NET checks the string in the opposite direction, flips the string, and searches for zero or more tabs and the string boundary. In case of (?<!^[ \t]*), the lookbehind returns false because there is a start position before 0 spaces or tabs (note we are still at the beginning of a string). The second expression lookbehind, (?<!^[ \t]+), returns true, because there is no tab or space before the start of string at the 0th index in the string, and thus, the [ \t]+ consuming pattern grabs the leading horizontal whitespace. That moves the regex index further and another match is found later in the string.

  • After failure at the beginning of the string, the first expression tries to match after the first space. However, the (?<!^[ \t]*) returns false because there is beginning of string followed with 1 space (the first one). Same story repeats with the position after the second space. The only spaces matched with the first (?<!^[ \t]*)[ \t]+ expression are those that are not at the beginning of the string.

Lookahead analogy

Check the analogous lookahead patterns: a [ \t]+(?![ \t]+$) pattern will find both whitespace chunks in "bb bb ", while [ \t]+(?![ \t]*$) will not match those at the end of the string. The same logic applies: 1) the * version allows matching an empty string, so the end of string is found and the negative lookahead returns false, the match is failed. When the + version encounters and consumes the trailing whitespaces, the regex engine, staying at the end of string, cannot find 1 or more spaces/tabs followed with another end of string, thus, the negative lookahead returns true and the trailing whitespaces are matched.