Regular Expressions, understanding lookbehind in c

2019-02-20 00:42发布

问题:

This is more a question of understanding than an actual problem. The situation explains as follows. I got some float numbers (e.g. an amount of money) between two quotation marks "".

Examples:

  1. "1,23"
  2. "12,23"
  3. "123,23"

Now I wanted to match the comma in those expressions. I built the following regex which works for me:

(?<=\"[0-9]|[0-9]{2})(,)(?=[0-9]{2}\")

The part which I don't completly understand is the lookbehind in combination with the or "|". But let's break it up:

(
?<=             //Start of the lookbehind
\"              //Starting with an escaped quotation mark "
[0-9]           //Followed by a digit between 0 and 9

Now I had the problem, that after the quotation mark wasn't always just one digit as you can see in the examples 2 and 3. The range operator e.g. {1,3} did not work within the lookbehind. As I found out in another stackoverflow question.

So I decided to use the or "|" operator as sugested here:

|[0-9]{2}       //Or followed by two digits between 0 and 9
)

The interesting part is that it also matches the comma in the third example "123,23". I don't really understand why. Also I don't know why I don't have to add the starting quotation mark after the or "|" operator again, because I thought that the complete lookbehind until the or operator would be necessary to be modified or repeated e.g.:

(?<=\"[0-9]|\"[0-9]{2})(,)(?=[0-9]{2}\")            //This however does not work at all

So in my understanding the corresponding regular expression to match all three examples should look like the following:

(?<=\"[0-9]|\"[0-9]{2}|\"[0-9]{3})(,)(?=[0-9]{2}\")

or at least (if someone can explain the missing \"):

(?<=\"[0-9]|[0-9]{2}|[0-9]{3})(,)(?=[0-9]{2}\")

I hope someone is able to help me understand the situation.

//Edit: If it is of special interest, I used this regex in a regular textfile in the sublime text 3 editor, to search for the comma and replace it.

回答1:

You are correct,

(?<=\"[0-9]|\"[0-9]{2}|\"[0-9]{3})(,)(?=[0-9]{2}\")

should be the right regex in this case.


About why you "don't need the \" for two and three digits" - you actually need it.

(?<=\"[0-9]|[0-9]{2}|[0-9]{3})(,)(?=[0-9]{2}\")

Will match 12,23" and 123,23" as well.


EDIT: Looks like the problem is that Sublime doesn't allow for variable length of lookbehind even if they are listed with |. Meaning (?<=\"[0-9]|\"[0-9]{2}|\"[0-9]{3}) will fail, because the alternatives are not of the same size - 2, 3, 4.

This is because Sublime seems to be using the Boost library regexes. There it is stated:

Lookbehind

(?<=pattern) consumes zero characters, only if pattern could be matched against the characters preceding the current position (pattern must be of fixed length).

(?<!pattern) consumes zero characters, only if pattern could not be matched against the characters preceding the current position (pattern must be of fixed length).

An alternative is to separate the lookbehinds:

(?:(?<=\"[0-9])|(?<=\"[0-9]{2})|(?<=\"[0-9]{3}))(,)(?=[0-9]{2}\")


What can you do if you don't want to list all possible lengths?

There is a cool trick which is present in some regex engines (including Perl's, Ruby's and Sublime's) - \K. What \K roughly translates to is "drop all that was matched so far". Therefore, you can match any , within a float number surrounded by quotation marks with:

"\d+\K,(?=\d+")

See it in action