How to split a string, but also keep the delimiter-第2页回答

I have a multiline string which is delimited by a set of different delimiters:

(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)

I can split this string into its parts, using String.split, but it seems that I can't get the actual string, which matched the delimiter regex.

In other words, this is what I get:

Text1
Text2
Text3
Text4

This is what I want

Text1
DelimiterA
Text2
DelimiterC
Text3
DelimiterB
Text4

Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?

标签： java

23条回答

零度萤火

2楼-- · 2019-01-01 09:59

I don't think it is possible with String#split, but you can use a StringTokenizer, though that won't allow you to define your delimiter as a regex, but only as a class of single-digit characters:

new StringTokenizer("Hello, world. Hi!", ",.!", true); // true for returnDelims

0人赞添加讨论(0) 举报

皆成旧梦

3楼-- · 2019-01-01 10:00

Here's a groovy version based on some of the code above, in case it helps. It's short, anyway. Conditionally includes the head and tail (if they are not empty). The last part is a demo/test case.

List splitWithTokens(str, pat) {
    def tokens=[]
    def lastMatch=0
    def m = str=~pat
    while (m.find()) {
      if (m.start() > 0) tokens << str[lastMatch..<m.start()]
      tokens << m.group()
      lastMatch=m.end()
    }
    if (lastMatch < str.length()) tokens << str[lastMatch..<str.length()]
    tokens
}

[['<html><head><title>this is the title</title></head>',/<[^>]+>/],
 ['before<html><head><title>this is the title</title></head>after',/<[^>]+>/]
].each { 
   println splitWithTokens(*it)
}

0人赞添加讨论(0) 举报

妖精总统

4楼-- · 2019-01-01 10:01

You can use Lookahead and Lookbehind. Like this:

System.out.println(Arrays.toString("a;b;c;d".split("(?<=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("(?=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("((?<=;)|(?=;))")));

And you will get:

[a;, b;, c;, d]
[a, ;b, ;c, ;d]
[a, ;, b, ;, c, ;, d]

The last one is what you want.

((?<=;)|(?=;)) equals to select an empty character before ; or after ;.

Hope this helps.

EDIT Fabian Steeg comments on Readability is valid. Readability is always the problem for RegEx. One thing, I do to help easing this is to create a variable whose name represent what the regex does and use Java String format to help that. Like this:

static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";
...
public void someMethod() {
...
final String[] aEach = "a;b;c;d".split(String.format(WITH_DELIMITER, ";"));
...
}
...

This helps a little bit. :-D

0人赞添加讨论(0) 举报

皆成旧梦

5楼-- · 2019-01-01 10:01

I got here late, but returning to the original question, why not just use lookarounds?

Pattern p = Pattern.compile("(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)");
System.out.println(Arrays.toString(p.split("'ab','cd','eg'")));
System.out.println(Arrays.toString(p.split("boo:and:foo")));

output:

[', ab, ',', cd, ',', eg, ']
[boo, :, and, :, foo]

EDIT: What you see above is what appears on the command line when I run that code, but I now see that it's a bit confusing. It's difficult to keep track of which commas are part of the result and which were added by Arrays.toString(). SO's syntax highlighting isn't helping either. In hopes of getting the highlighting to work with me instead of against me, here's how those arrays would look it I were declaring them in source code:

{ "'", "ab", "','", "cd", "','", "eg", "'" }
{ "boo", ":", "and", ":", "foo" }

I hope that's easier to read. Thanks for the heads-up, @finnw.

0人赞添加讨论(0) 举报

初与友歌

6楼-- · 2019-01-01 10:03

If you are worried about the complications look-ahead/look-behind stuff might introduce, and just want a rock-solid utility method that can cope with any token pattern and any separators you throw at it. (Which is probably the case!)

NB surprised to find that the Apache Commons people don't seem to have provided this, e.g. in StringUtils.

Also I suggest that this should be a flag in Pattern: i..e INCLUDE_SEPARATORS.

But this is pretty simple if you use the Pattern and Matcher classes right:

    // NB could be a different spec for identifying tokens, of course!
    Pattern sepAndTokenPattern = Pattern.compile("(.*?)(\\w+)");
    Matcher matcher = sepAndTokenPattern.matcher( stringForTokenising );
    List<String> tokenAndSeparatorList = new ArrayList<String>();

    // for most processing purposes you are going to want to know whether your 
    // combined list of tokens and separators begins with a token or separator        
    boolean startsWithToken = true;
    int matchEnd = -1;
    while (matcher.find()) {
        String preSep = matcher.group(1);
        if (!preSep.isEmpty()) {
            if( tokenAndSeparatorList.isEmpty() ){
                startsWithToken = false;
            }
            // in implementation you wouldn't want these | characters, of course 
            tokenAndSeparatorList.add("|" + preSep + "|"); // add sep
        }
        tokenAndSeparatorList.add("|" + matcher.group(2) + "|"); // add token
        matchEnd = matcher.end();
    }
    // get trailing separator, if there is one:
    if( matchEnd != -1 ){
        String trailingSep = stringForTokenising.substring( matchEnd );
        if( ! trailingSep.isEmpty() ){
            tokenAndSeparatorList.add( "|" + trailingSep + "|" );
        }
    }

    System.out.println(String.format("# starts with token? %b - matchList %s", startsWithToken, tokenAndSeparatorList));

0人赞添加讨论(0) 举报

怪性笑人.

7楼-- · 2019-01-01 10:04

A very naive solution, that doesn't involve regex would be to perform a string replace on your delimiter along the lines of (assuming comma for delimiter):

string.replace(FullString, "," , "~,~")

Where you can replace tilda (~) with an appropriate unique delimiter.

Then if you do a split on your new delimiter then i believe you will get the desired result.

0人赞添加讨论(0) 举报

How to split a string, but also keep the delimiter

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间