How to split a string, but also keep the delimiter

2019-01-01 09:47发布

I have a multiline string which is delimited by a set of different delimiters:

(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)

I can split this string into its parts, using String.split, but it seems that I can't get the actual string, which matched the delimiter regex.

In other words, this is what I get:

  • Text1
  • Text2
  • Text3
  • Text4

This is what I want

  • Text1
  • DelimiterA
  • Text2
  • DelimiterC
  • Text3
  • DelimiterB
  • Text4

Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?

标签: java
23条回答
零度萤火
2楼-- · 2019-01-01 09:59

I don't think it is possible with String#split, but you can use a StringTokenizer, though that won't allow you to define your delimiter as a regex, but only as a class of single-digit characters:

new StringTokenizer("Hello, world. Hi!", ",.!", true); // true for returnDelims
查看更多
皆成旧梦
3楼-- · 2019-01-01 10:00

Here's a groovy version based on some of the code above, in case it helps. It's short, anyway. Conditionally includes the head and tail (if they are not empty). The last part is a demo/test case.

List splitWithTokens(str, pat) {
    def tokens=[]
    def lastMatch=0
    def m = str=~pat
    while (m.find()) {
      if (m.start() > 0) tokens << str[lastMatch..<m.start()]
      tokens << m.group()
      lastMatch=m.end()
    }
    if (lastMatch < str.length()) tokens << str[lastMatch..<str.length()]
    tokens
}

[['<html><head><title>this is the title</title></head>',/<[^>]+>/],
 ['before<html><head><title>this is the title</title></head>after',/<[^>]+>/]
].each { 
   println splitWithTokens(*it)
}
查看更多
妖精总统
4楼-- · 2019-01-01 10:01

You can use Lookahead and Lookbehind. Like this:

System.out.println(Arrays.toString("a;b;c;d".split("(?<=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("(?=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("((?<=;)|(?=;))")));

And you will get:

[a;, b;, c;, d]
[a, ;b, ;c, ;d]
[a, ;, b, ;, c, ;, d]

The last one is what you want.

((?<=;)|(?=;)) equals to select an empty character before ; or after ;.

Hope this helps.

EDIT Fabian Steeg comments on Readability is valid. Readability is always the problem for RegEx. One thing, I do to help easing this is to create a variable whose name represent what the regex does and use Java String format to help that. Like this:

static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";
...
public void someMethod() {
...
final String[] aEach = "a;b;c;d".split(String.format(WITH_DELIMITER, ";"));
...
}
...

This helps a little bit. :-D

查看更多
皆成旧梦
5楼-- · 2019-01-01 10:01

I got here late, but returning to the original question, why not just use lookarounds?

Pattern p = Pattern.compile("(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)");
System.out.println(Arrays.toString(p.split("'ab','cd','eg'")));
System.out.println(Arrays.toString(p.split("boo:and:foo")));

output:

[', ab, ',', cd, ',', eg, ']
[boo, :, and, :, foo]

EDIT: What you see above is what appears on the command line when I run that code, but I now see that it's a bit confusing. It's difficult to keep track of which commas are part of the result and which were added by Arrays.toString(). SO's syntax highlighting isn't helping either. In hopes of getting the highlighting to work with me instead of against me, here's how those arrays would look it I were declaring them in source code:

{ "'", "ab", "','", "cd", "','", "eg", "'" }
{ "boo", ":", "and", ":", "foo" }

I hope that's easier to read. Thanks for the heads-up, @finnw.

查看更多
初与友歌
6楼-- · 2019-01-01 10:03

If you are worried about the complications look-ahead/look-behind stuff might introduce, and just want a rock-solid utility method that can cope with any token pattern and any separators you throw at it. (Which is probably the case!)

NB surprised to find that the Apache Commons people don't seem to have provided this, e.g. in StringUtils.

Also I suggest that this should be a flag in Pattern: i..e INCLUDE_SEPARATORS.

But this is pretty simple if you use the Pattern and Matcher classes right:

    // NB could be a different spec for identifying tokens, of course!
    Pattern sepAndTokenPattern = Pattern.compile("(.*?)(\\w+)");
    Matcher matcher = sepAndTokenPattern.matcher( stringForTokenising );
    List<String> tokenAndSeparatorList = new ArrayList<String>();

    // for most processing purposes you are going to want to know whether your 
    // combined list of tokens and separators begins with a token or separator        
    boolean startsWithToken = true;
    int matchEnd = -1;
    while (matcher.find()) {
        String preSep = matcher.group(1);
        if (!preSep.isEmpty()) {
            if( tokenAndSeparatorList.isEmpty() ){
                startsWithToken = false;
            }
            // in implementation you wouldn't want these | characters, of course 
            tokenAndSeparatorList.add("|" + preSep + "|"); // add sep
        }
        tokenAndSeparatorList.add("|" + matcher.group(2) + "|"); // add token
        matchEnd = matcher.end();
    }
    // get trailing separator, if there is one:
    if( matchEnd != -1 ){
        String trailingSep = stringForTokenising.substring( matchEnd );
        if( ! trailingSep.isEmpty() ){
            tokenAndSeparatorList.add( "|" + trailingSep + "|" );
        }
    }

    System.out.println(String.format("# starts with token? %b - matchList %s", startsWithToken, tokenAndSeparatorList));
查看更多
怪性笑人.
7楼-- · 2019-01-01 10:04

A very naive solution, that doesn't involve regex would be to perform a string replace on your delimiter along the lines of (assuming comma for delimiter):

string.replace(FullString, "," , "~,~")

Where you can replace tilda (~) with an appropriate unique delimiter.

Then if you do a split on your new delimiter then i believe you will get the desired result.

查看更多
登录 后发表回答