-->

Capture , if between substring “”

2020-05-09 19:13发布

问题:

(?!([^"]*"[^"]*")*[^"]*$),

This captures , if it is between "".Lets say the test string is

 1 2 3 4 , 5 6 7, 8 9 "10 11 12 , 13 14 15," 16,17,"18,19,"20 21,22

captured will be between 12 and 13 and rest like

  http://regex101.com/r/rX0dM7/1

Now if i change the same the to

(?!(.*?".*?")*[^"]*?$),

This captures only the end ,'s between 18 and 19.Something like

  http://regex101.com/r/hL7uS1/1

Now the question is why is [^"] so different from .*?" .

Secondly what is the significance of [^"]*$ as if i remove it nothing gets captured.

回答1:

Conceptually, I think you are not thinking this correctly.

Lookarounds are centric about where the current search Position is.

In your lookahead, you are negatively matching the comma in an expression before it even finds the comma.

This is called overlap.

Lookahead's usually are inserted after a matched subexpession has been consumed, the position
increases then the assertion is checked.

Likewise, Lookbehinds typically go before the object subexpression.

So, your regex is actually this

 ,
 (?!
      ( [^"]* " [^"]* " )*
      [^"]* $ 
 ) 

When you do this, you can easily see that after removing [^"]*$
this ( [^"]* " [^"]* " )* matches at every point in the string.
Because it is optional.

If you were to change it to ( [^"]* " [^"]* " )+ then it would find
something concrete to negatively match against.
The $ was serving that purpose before.

Hope you have a better understanding now.



回答2:

  • [^"]*" matches any number of characters except quotes, followed by a quote.
  • .*?" matches any number of characters, including quotes, followed by a quote.

Now the ? in the second regex makes the * quantifier lazy, which means that it asks it nicely to match as few characters as possible to make the match happen. Therefore, in the string abc"def", both regexes will match the same text. So far, so good.

The problem is now that you've enclosed that regex in a negative lookahead assertion which has to make sure that the regex inside it is impossible to match. Since the dot may also match a quote if it has to, it will do so in order to make a match possible, and that will cause the lookahead to fail unless there are only two quotes left in the string.

For your second question, [^"]*$ makes sure that other characters besides quotes are allowed to appear at the end of the string.