How to split a string, but also keep the delimiter

2018-12-31 02:20发布

I have a multiline string which is delimited by a set of different delimiters:

(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)

I can split this string into its parts, using String.split, but it seems that I can't get the actual string, which matched the delimiter regex.

In other words, this is what I get:

  • Text1
  • Text2
  • Text3
  • Text4

This is what I want

  • Text1
  • DelimiterA
  • Text2
  • DelimiterC
  • Text3
  • DelimiterB
  • Text4

Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?

标签: java
23条回答
与君花间醉酒
2楼-- · 2018-12-31 02:44

Here is a simple clean implementation which is consistent with Pattern#split and works with variable length patterns, which look behind cannot support, and it is easier to use. It is similar to the solution provided by @cletus.

public static String[] split(CharSequence input, String pattern) {
    return split(input, Pattern.compile(pattern));
}

public static String[] split(CharSequence input, Pattern pattern) {
    Matcher matcher = pattern.matcher(input);
    int start = 0;
    List<String> result = new ArrayList<>();
    while (matcher.find()) {
        result.add(input.subSequence(start, matcher.start()).toString());
        result.add(matcher.group());
        start = matcher.end();
    }
    if (start != input.length()) result.add(input.subSequence(start, input.length()).toString());
    return result.toArray(new String[0]);
}

I don't do null checks here, Pattern#split doesn't, why should I. I don't like the if at the end but it is required for consistency with the Pattern#split . Otherwise I would unconditionally append, resulting in an empty string as the last element of the result if the input string ends with the pattern.

I convert to String[] for consistency with Pattern#split, I use new String[0] rather than new String[result.size()], see here for why.

Here are my tests:

@Test
public void splitsVariableLengthPattern() {
    String[] result = Split.split("/foo/$bar/bas", "\\$\\w+");
    Assert.assertArrayEquals(new String[] { "/foo/", "$bar", "/bas" }, result);
}

@Test
public void splitsEndingWithPattern() {
    String[] result = Split.split("/foo/$bar", "\\$\\w+");
    Assert.assertArrayEquals(new String[] { "/foo/", "$bar" }, result);
}

@Test
public void splitsStartingWithPattern() {
    String[] result = Split.split("$foo/bar", "\\$\\w+");
    Assert.assertArrayEquals(new String[] { "", "$foo", "/bar" }, result);
}

@Test
public void splitsNoMatchesPattern() {
    String[] result = Split.split("/foo/bar", "\\$\\w+");
    Assert.assertArrayEquals(new String[] { "/foo/bar" }, result);
}
查看更多
姐姐魅力值爆表
3楼-- · 2018-12-31 02:44

If you can afford, use Java's replace(CharSequence target, CharSequence replacement) method and fill in another delimiter to split with. Example: I want to split the string "boo:and:foo" and keep ':' at its righthand String.

String str = "boo:and:foo";
str = str.replace(":","newdelimiter:");
String[] tokens = str.split("newdelimiter");

Important note: This only works if you have no further "newdelimiter" in your String! Thus, it is not a general solution. But if you know a CharSequence of which you can be sure that it will never appear in the String, this is a very simple solution.

查看更多
余欢
4楼-- · 2018-12-31 02:45

You want to use lookarounds, and split on zero-width matches. Here are some examples:

public class SplitNDump {
    static void dump(String[] arr) {
        for (String s : arr) {
            System.out.format("[%s]", s);
        }
        System.out.println();
    }
    public static void main(String[] args) {
        dump("1,234,567,890".split(","));
        // "[1][234][567][890]"
        dump("1,234,567,890".split("(?=,)"));   
        // "[1][,234][,567][,890]"
        dump("1,234,567,890".split("(?<=,)"));  
        // "[1,][234,][567,][890]"
        dump("1,234,567,890".split("(?<=,)|(?=,)"));
        // "[1][,][234][,][567][,][890]"

        dump(":a:bb::c:".split("(?=:)|(?<=:)"));
        // "[][:][a][:][bb][:][:][c][:]"
        dump(":a:bb::c:".split("(?=(?!^):)|(?<=:)"));
        // "[:][a][:][bb][:][:][c][:]"
        dump(":::a::::b  b::c:".split("(?=(?!^):)(?<!:)|(?!:)(?<=:)"));
        // "[:::][a][::::][b  b][::][c][:]"
        dump("a,bb:::c  d..e".split("(?!^)\\b"));
        // "[a][,][bb][:::][c][  ][d][..][e]"

        dump("ArrayIndexOutOfBoundsException".split("(?<=[a-z])(?=[A-Z])"));
        // "[Array][Index][Out][Of][Bounds][Exception]"
        dump("1234567890".split("(?<=\\G.{4})"));   
        // "[1234][5678][90]"

        // Split at the end of each run of letter
        dump("Boooyaaaah! Yippieeee!!".split("(?<=(?=(.)\\1(?!\\1))..)"));
        // "[Booo][yaaaa][h! Yipp][ieeee][!!]"
    }
}

And yes, that is triply-nested assertion there in the last pattern.

Related questions

See also

查看更多
一个人的天荒地老
5楼-- · 2018-12-31 02:46

Another candidate solution using a regex. Retains token order, correctly matches multiple tokens of the same type in a row. The downside is that the regex is kind of nasty.

package javaapplication2;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class JavaApplication2 {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        String num = "58.5+variable-+98*78/96+a/78.7-3443*12-3";

        // Terrifying regex:
        //  (a)|(b)|(c) match a or b or c
        // where
        //   (a) is one or more digits optionally followed by a decimal point
        //       followed by one or more digits: (\d+(\.\d+)?)
        //   (b) is one of the set + * / - occurring once: ([+*/-])
        //   (c) is a sequence of one or more lowercase latin letter: ([a-z]+)
        Pattern tokenPattern = Pattern.compile("(\\d+(\\.\\d+)?)|([+*/-])|([a-z]+)");
        Matcher tokenMatcher = tokenPattern.matcher(num);

        List<String> tokens = new ArrayList<>();

        while (!tokenMatcher.hitEnd()) {
            if (tokenMatcher.find()) {
                tokens.add(tokenMatcher.group());
            } else {
                // report error
                break;
            }
        }

        System.out.println(tokens);
    }
}

Sample output:

[58.5, +, variable, -, +, 98, *, 78, /, 96, +, a, /, 78.7, -, 3443, *, 12, -, 3]
查看更多
其实,你不懂
6楼-- · 2018-12-31 02:47

If you are worried about the complications look-ahead/look-behind stuff might introduce, and just want a rock-solid utility method that can cope with any token pattern and any separators you throw at it. (Which is probably the case!)

NB surprised to find that the Apache Commons people don't seem to have provided this, e.g. in StringUtils.

Also I suggest that this should be a flag in Pattern: i..e INCLUDE_SEPARATORS.

But this is pretty simple if you use the Pattern and Matcher classes right:

    // NB could be a different spec for identifying tokens, of course!
    Pattern sepAndTokenPattern = Pattern.compile("(.*?)(\\w+)");
    Matcher matcher = sepAndTokenPattern.matcher( stringForTokenising );
    List<String> tokenAndSeparatorList = new ArrayList<String>();

    // for most processing purposes you are going to want to know whether your 
    // combined list of tokens and separators begins with a token or separator        
    boolean startsWithToken = true;
    int matchEnd = -1;
    while (matcher.find()) {
        String preSep = matcher.group(1);
        if (!preSep.isEmpty()) {
            if( tokenAndSeparatorList.isEmpty() ){
                startsWithToken = false;
            }
            // in implementation you wouldn't want these | characters, of course 
            tokenAndSeparatorList.add("|" + preSep + "|"); // add sep
        }
        tokenAndSeparatorList.add("|" + matcher.group(2) + "|"); // add token
        matchEnd = matcher.end();
    }
    // get trailing separator, if there is one:
    if( matchEnd != -1 ){
        String trailingSep = stringForTokenising.substring( matchEnd );
        if( ! trailingSep.isEmpty() ){
            tokenAndSeparatorList.add( "|" + trailingSep + "|" );
        }
    }

    System.out.println(String.format("# starts with token? %b - matchList %s", startsWithToken, tokenAndSeparatorList));
查看更多
登录 后发表回答