How does \\G work in .split?

2019-06-16 05:36发布

问题:

I like to do code-golfing in Java (even though Java way too verbose to be competitive), which is completing a certain challenge in as few bytes as possible. In one of my answers I had the following piece of code:

for(var p:"A4;B8;CU;EM;EW;E3;G6;G9;I1;L7;NZ;O0;R2;S5".split(";"))

Which basically loops over the 2-char Strings after we converted it into a String-array with .split. Someone suggested I could golf it to this instead to save 4 bytes:

for(var p:"A4B8CUEMEWE3G6G9I1L7NZO0R2S5".split("(?<=\\G..)"))

The functionality is still the same. It loops over the 2-char Strings.

However, neither of us was 100% sure how this works, hence this question.


What I know:

I know .split("(?<= ... )") is used to split, but keep the trailing delimiter.
There is also a way to keep a leading delimiter, or delimiter as separated item:

"a;b;c;d".split("(?<=;)")            // Results in ["a;", "b;", "c;", "d"]
"a;b;c;d".split("(?=;)")             // Results in ["a", ";b", ";c", ";d"]
"a;b;c;d".split("((?<=;)|(?=;))")    // Results in ["a", ";", "b", ";", "c", ";", "d"]

I know \G is used to stop after a non-match is encountered.
EDIT: \G is used to indicate the position where the last match ended (or the start of the string for the first run). Corrected definition thanks to @SebastianProske.

int count = 0;
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile("match,");
java.util.regex.Matcher matcher = pattern.matcher("match,match,match,blabla,match,match,");
while(matcher.find())
  count++;
System.out.println(count); // Results in 5

count = 0;
pattern = java.util.regex.Pattern.compile("\\Gmatch,");
matcher = pattern.matcher("match,match,match,blabla,match,match,");
while(matcher.find())
  count++;
System.out.println(count); // Results in 3

But how does .split("(?<=\\G..)") work exactly when using \G inside the split?
And why does .split("(?=\\G..)") not work?

Here a "Try it online"-link for all code-snippets described above to see them in action.

回答1:

how does .split("(?<=\\G..)") work

(?<=X) is a zero-width positive lookbehind for X. \G is the end of the previous match (not some kind of stop instruction) or beginning of input, and of course .. is two individual characters. So (?<=\G..) is a zero-width lookbehind for the end of the previous match plus two characters. Since this is split and we're describing a delimiter, making the entire thing a zero-width assertion means we only use it to identify where to break the string, not to actually consume any characters.

So let's walk through ABCDEF:

  1. \G matches beginning of input, and .. matches AB, so (?<=\G..) finds the zero-width space between AB and CD because this is a lookbehind: That is, the first point at which there is \G.. prior to the regex cursor is the point between AB and CD. So split between AB and CD.
  2. \G marks the location just after AB so (?<=\G..) finds the zero-width space between CD and EF, because as the regex cursor goes forward, that's the first place where \G.. matches: \G matching the location between AB and CD and .. matching CD. So split between CD and EF.
  3. Same again: \G marks the location just after CD so (?<=\G..) finds the zero-width space between EF and end-of-input. So split between EF and end-of-input.
  4. Create an array with all of the matches except the empty one at the end (because this is split with an implicit length = 0 which discards empty strings at the end).

Result { "AB", "CD", "EF" }.

And why does .split("(?=\\G..)") not work?

Because (?=X) is a positive lookahead. The end of the previous match will never be ahead of the regex cursor. It can only be behind it.



回答2:

First off, \G definition: it's an anchor which matches beginning of string or end of previous match. It's a position. It neither does consume a character nor changes cursor position. Alan Moore previously in an answer wrote this behavior of \G inside lookbehinds is engine specific. This would split at desired length in Java but doesn't produce the same result in PCRE.

So how does \G in (?<=\G..) work? Look at below step-by-step demonstration of where dot and \G match:

 ↓A4
\G..↓B8
   \G..↓CU
      \G..
       .
       .

\G matches beginning of input string then dots match A and 4 in order. Engine continues traversing and stop right between 8 and C. Here lookbehind matches:

A   4   B  8
     \G .  . (?<=\G..)

Where \G matches is where previous dots ended matching i.e. position right after 4 and before B. This process continues to the end of input string. It splits a string by 2 units of data (safely a character here). It shouldn't work on multi-line input strings and if it does it splits partially since dot . doesn't match a newline character or it doesn't split at all since \G doesn't match start of a line (only start of input string).

And why does .split("(?=\\G..)") not work?

Because of a lookahead's nature - which looks forward - there is no possiblities for it to meet where previous match ended. It just continues walking, till to the end.