Strange behavior in regexes

2020-02-14 06:37发布

问题:

There was a question about regex and trying to answer I found another strange things.

String x = "X";
System.out.println(x.replaceAll("X*", "Y"));

This prints YY. why??

String x = "X";
System.out.println(x.replaceAll("X*?", "Y"));

And this prints YXY

Why reluctant regex doesn't match 'X' character? There is "noting"X"nothing" but why first doesn't match three symbols and matches two and then one instead of three? and second regex matches only "nothing"s and not X?

回答1:

Let's consider them in turn:

"X".replaceAll("X*", "Y")

There are two matches:

  1. At character position 0, X is matched, and is replaced with Y.
  2. At character position 1, the empty string is matched, and Y gets added to the output.

End result: YY.

"X".replaceAll("X*?", "Y")

There are also two matches:

  1. At character position 0, the empty string is matched, and Y gets added to the output. The character at this position, X, was not consumed by the match, and is therefore copied into the output verbatim.
  2. At character position 1, the empty string is matched, and Y gets added to the output.

End result: YXY.



回答2:

The * is a tricky 'quantifier' since it means '0 or more'. Thus, it also matches '0 times X' (i.e. an empty string).

I would use

"X".replaceAll("X+", "Y")

which has the expected behaviour.



回答3:

In your first example you are using a "Greedy" quantifier. This means that the input string is forced to be read entirely before attempting the first match, so the first match tried is the whole input. If the input matches, the matcher goes past the input string and performs the zero-length match at the end of the string hence the two matches you see. The greedy matcher never backs-off to the zero-length match before the character X before the first match attempt was successful.

On the second example you are using a "Reluctant" quantifier which does the opposite of "Greedy". It starts at the beginning and tries to match one character at the time going forward (if it has to). So the zero-length match before the "X" character is matched, matcher moves forward by one (that's why you still see the "X" character in the output) where the next match is now the zero-length match after the "X".
There is a good tutorial here: http://docs.oracle.com/javase/tutorial/essential/regex/quant.html