Regex split into overlapping strings

2019-03-22 03:36发布

问题:

I'm exploring the power of regular expressions, so I'm just wondering if something like this is possible:

public class StringSplit {
    public static void main(String args[]) {
        System.out.println(
            java.util.Arrays.deepToString(
                "12345".split(INSERT_REGEX_HERE)
            )
        ); // prints "[12, 23, 34, 45]"
    }
}

If possible, then simply provide the regex (and preemptively some explanation on how it works).

If it's only possible in some regex flavors other than Java, then feel free to provide those as well.

If it's not possible, then please explain why.


BONUS QUESTION

Same question, but with a find() loop instead of split:

    Matcher m = Pattern.compile(BONUS_REGEX).matcher("12345");
    while (m.find()) {
        System.out.println(m.group());
    } // prints "12", "23", "34", "45"

Please note that it's not so much that I have a concrete task to accomplish one way or another, but rather I want to understand regular expressions. I don't need code that does what I want; I want regexes, if they exist, that I can use in the above code to accomplish the task (or regexes in other flavors that work with a "direct translation" of the code into another language).

And if they don't exist, I'd like a good solid explanation why.

回答1:

I don't think this is possible with split(), but with find() it's pretty simple. Just use a lookahead with a capturing group inside:

Matcher m = Pattern.compile("(?=(\\d\\d)).").matcher("12345");
while (m.find())
{
  System.out.println(m.group(1));
}

Many people don't realize that text captured inside a lookahead or lookbehind can be referenced after the match just like any other capture. It's especially counter-intuitive in this case, where the capture is a superset of the "whole" match.

As a matter of fact, it works even if the regex as a whole matches nothing. Remove the dot from the regex above ("(?=(\\d\\d))") and you'll get the same result. This is because, whenever a successful match consumes no characters, the regex engine automatically bumps ahead one position before trying to match again, to prevent infinite loops.

There's no split() equivalent for this technique, though, at least not in Java. Although you can split on lookarounds and other zero-width assertions, there's no way to get the same character to appear in more than one of the resulting substrings.



回答2:

This somewhat heavy implementation using Matcher.find instead of split will also work, although by the time you have to code a for loop for such a trivial task you might as well drop the regular expressions altogether and use substrings (for similar coding complexity minus the CPU cycles):

import java.util.*;
import java.util.regex.*;

public class StringSplit { 
    public static void main(String args[]) { 
        ArrayList<String> result = new ArrayList<String>();
        for (Matcher m = Pattern.compile("..").matcher("12345"); m.find(result.isEmpty() ? 0 : m.start() + 1); result.add(m.group()));
        System.out.println( result.toString() ); // prints "[12, 23, 34, 45]" 
    } 
} 

EDIT1

match(): the reason why nobody so far has been able to concoct a regular expression like your BONUS_REGEX lies within Matcher, which will resume looking for the next group where the previous group ended (i.e. no overlap), as oposed to after where the previous group started -- that is, short of explicitly respecifying the start search position (above). A good candidate for BONUS_REGEX would have been "(.\\G.|^..)" but, unfortunately, the \G-anchor-in-the-middle trick doesn't work with Java's Match (but works just fine in Perl):

 perl -e 'while ("12345"=~/(^..|.\G.)/g) { print "$1\n" }'
 12
 23
 34
 45

split(): as for INSERT_REGEX_HERE a good candidate would have been (?<=..)(?=..) (split point is the zero-width position where I have two characters to my right and two to my left), but again, because split concieves naught of overlap you end up with [12, 3, 45] (which is close, but no cigar.)

EDIT2

For fun, you can trick split() into doing what you want by first doubling non-boundary characters (here you need a reserved character value to split around):

Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1#$1").split("#")

We can be smart and eliminate the need for a reserved character by taking advantage of the fact that zero-width look-ahead assertions (unlike look-behind) can have an unbounded length; we can therefore split around all points which are an even number of characters away from the end of the doubled string (and at least two characters away from its beginning), producing the same result as above:

Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1$1").split("(?<=..)(?=(..)*$)")

Alternatively tricking match() in a similar way (but without the need for a reserved character value):

Matcher m = Pattern.compile("..").matcher(
  Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1$1")
);
while (m.find()) { 
    System.out.println(m.group()); 
} // prints "12", "23", "34", "45" 


回答3:

Split chops a string into multiple pieces, but that doesn't allow for overlap. You'd need to use a loop to get overlapping pieces.



回答4:

I don't think you can do this with split() because it throws away the part that matches the regular expression.

In Perl this works:

my $string = '12345';
my @array = ();
while ( $string =~ s/(\d(\d))/$2/ ) {
    push(@array, $1);
}
print join(" ", @array);
# prints: 12 23 34 45

The find-and-replace expression says: match the first two adjacent digits and replace them in the string with just the second of the two digits.



回答5:

Alternative, using plain matching with Perl. Should work anywhere where lookaheads do. And no need for loops here.

 $_ = '12345';
 @list = /(?=(..))./g;
 print "@list";

 # Output:
 # 12 23 34 45

But this one, as posted earlier, is nicer if the \G trick works:

 $_ = '12345';
 @list = /^..|.\G./g;
 print "@list";

 # Output:
 # 12 23 34 45

Edit: Sorry, didn't see that all of this was posted already.