How to split a string, but also keep the delimiter

2019-01-01 09:47发布

I have a multiline string which is delimited by a set of different delimiters:

(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)

I can split this string into its parts, using String.split, but it seems that I can't get the actual string, which matched the delimiter regex.

In other words, this is what I get:

  • Text1
  • Text2
  • Text3
  • Text4

This is what I want

  • Text1
  • DelimiterA
  • Text2
  • DelimiterC
  • Text3
  • DelimiterB
  • Text4

Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?

标签: java
23条回答
不流泪的眼
2楼-- · 2019-01-01 10:08

Tweaked Pattern.split() to include matched pattern to the list

Added

// add match to the list
        matchList.add(input.subSequence(start, end).toString());

Full source

public static String[] inclusiveSplit(String input, String re, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<String>();

    Pattern pattern = Pattern.compile(re);
    Matcher m = pattern.matcher(input);

    // Add segments before each match found
    while (m.find()) {
        int end = m.end();
        if (!matchLimited || matchList.size() < limit - 1) {
            int start = m.start();
            String match = input.subSequence(index, start).toString();
            matchList.add(match);
            // add match to the list
            matchList.add(input.subSequence(start, end).toString());
            index = end;
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index, input.length())
                    .toString();
            matchList.add(match);
            index = end;
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] { input.toString() };

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize - 1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}
查看更多
一个人的天荒地老
3楼-- · 2019-01-01 10:09

I will post my working versions also(first is really similar to Markus).

public static String[] splitIncludeDelimeter(String regex, String text){
    List<String> list = new LinkedList<>();
    Matcher matcher = Pattern.compile(regex).matcher(text);

    int now, old = 0;
    while(matcher.find()){
        now = matcher.end();
        list.add(text.substring(old, now));
        old = now;
    }

    if(list.size() == 0)
        return new String[]{text};

    //adding rest of a text as last element
    String finalElement = text.substring(old);
    list.add(finalElement);

    return list.toArray(new String[list.size()]);
}

And here is second solution and its round 50% faster than first one:

public static String[] splitIncludeDelimeter2(String regex, String text){
    List<String> list = new LinkedList<>();
    Matcher matcher = Pattern.compile(regex).matcher(text);

    StringBuffer stringBuffer = new StringBuffer();
    while(matcher.find()){
        matcher.appendReplacement(stringBuffer, matcher.group());
        list.add(stringBuffer.toString());
        stringBuffer.setLength(0); //clear buffer
    }

    matcher.appendTail(stringBuffer); ///dodajemy reszte  ciagu
    list.add(stringBuffer.toString());

    return list.toArray(new String[list.size()]);
}
查看更多
何处买醉
4楼-- · 2019-01-01 10:10

Here is a simple clean implementation which is consistent with Pattern#split and works with variable length patterns, which look behind cannot support, and it is easier to use. It is similar to the solution provided by @cletus.

public static String[] split(CharSequence input, String pattern) {
    return split(input, Pattern.compile(pattern));
}

public static String[] split(CharSequence input, Pattern pattern) {
    Matcher matcher = pattern.matcher(input);
    int start = 0;
    List<String> result = new ArrayList<>();
    while (matcher.find()) {
        result.add(input.subSequence(start, matcher.start()).toString());
        result.add(matcher.group());
        start = matcher.end();
    }
    if (start != input.length()) result.add(input.subSequence(start, input.length()).toString());
    return result.toArray(new String[0]);
}

I don't do null checks here, Pattern#split doesn't, why should I. I don't like the if at the end but it is required for consistency with the Pattern#split . Otherwise I would unconditionally append, resulting in an empty string as the last element of the result if the input string ends with the pattern.

I convert to String[] for consistency with Pattern#split, I use new String[0] rather than new String[result.size()], see here for why.

Here are my tests:

@Test
public void splitsVariableLengthPattern() {
    String[] result = Split.split("/foo/$bar/bas", "\\$\\w+");
    Assert.assertArrayEquals(new String[] { "/foo/", "$bar", "/bas" }, result);
}

@Test
public void splitsEndingWithPattern() {
    String[] result = Split.split("/foo/$bar", "\\$\\w+");
    Assert.assertArrayEquals(new String[] { "/foo/", "$bar" }, result);
}

@Test
public void splitsStartingWithPattern() {
    String[] result = Split.split("$foo/bar", "\\$\\w+");
    Assert.assertArrayEquals(new String[] { "", "$foo", "/bar" }, result);
}

@Test
public void splitsNoMatchesPattern() {
    String[] result = Split.split("/foo/bar", "\\$\\w+");
    Assert.assertArrayEquals(new String[] { "/foo/bar" }, result);
}
查看更多
梦该遗忘
5楼-- · 2019-01-01 10:12

I like the idea of StringTokenizer because it is Enumerable.
But it is also obsolete, and replace by String.split which return a boring String[] (and does not includes the delimiters).

So I implemented a StringTokenizerEx which is an Iterable, and which takes a true regexp to split a string.

A true regexp means it is not a 'Character sequence' repeated to form the delimiter:
'o' will only match 'o', and split 'ooo' into three delimiter, with two empty string inside:

[o], '', [o], '', [o]

But the regexp o+ will return the expected result when splitting "aooob"

[], 'a', [ooo], 'b', []

To use this StringTokenizerEx:

final StringTokenizerEx aStringTokenizerEx = new StringTokenizerEx("boo:and:foo", "o+");
final String firstDelimiter = aStringTokenizerEx.getDelimiter();
for(String aString: aStringTokenizerEx )
{
    // uses the split String detected and memorized in 'aString'
    final nextDelimiter = aStringTokenizerEx.getDelimiter();
}

The code of this class is available at DZone Snippets.

As usual for a code-challenge response (one self-contained class with test cases included), copy-paste it (in a 'src/test' directory) and run it. Its main() method illustrates the different usages.


Note: (late 2009 edit)

The article Final Thoughts: Java Puzzler: Splitting Hairs does a good work explaning the bizarre behavior in String.split().
Josh Bloch even commented in response to that article:

Yes, this is a pain. FWIW, it was done for a very good reason: compatibility with Perl.
The guy who did it is Mike "madbot" McCloskey, who now works with us at Google. Mike made sure that Java's regular expressions passed virtually every one of the 30K Perl regular expression tests (and ran faster).

The Google common-library Guava contains also a Splitter which is:

  • simpler to use
  • maintained by Google (and not by you)

So it may worth being checked out. From their initial rough documentation (pdf):

JDK has this:

String[] pieces = "foo.bar".split("\\.");

It's fine to use this if you want exactly what it does: - regular expression - result as an array - its way of handling empty pieces

Mini-puzzler: ",a,,b,".split(",") returns...

(a) "", "a", "", "b", ""
(b) null, "a", null, "b", null
(c) "a", null, "b"
(d) "a", "b"
(e) None of the above

Answer: (e) None of the above.

",a,,b,".split(",")
returns
"", "a", "", "b"

Only trailing empties are skipped! (Who knows the workaround to prevent the skipping? It's a fun one...)

In any case, our Splitter is simply more flexible: The default behavior is simplistic:

Splitter.on(',').split(" foo, ,bar, quux,")
--> [" foo", " ", "bar", " quux", ""]

If you want extra features, ask for them!

Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split(" foo, ,bar, quux,")
--> ["foo", "bar", "quux"]

Order of config methods doesn't matter -- during splitting, trimming happens before checking for empties.

查看更多
谁念西风独自凉
6楼-- · 2019-01-01 10:13

Pass the 3rd aurgument as "true". It will return delimiters as well.

StringTokenizer(String str, String delimiters, true);
查看更多
浪荡孟婆
7楼-- · 2019-01-01 10:13

I don't know of an existing function in the Java API that does this (which is not to say it doesn't exist), but here's my own implementation (one or more delimiters will be returned as a single token; if you want each delimiter to be returned as a separate token, it will need a bit of adaptation):

static String[] splitWithDelimiters(String s) {
    if (s == null || s.length() == 0) {
        return new String[0];
    }
    LinkedList<String> result = new LinkedList<String>();
    StringBuilder sb = null;
    boolean wasLetterOrDigit = !Character.isLetterOrDigit(s.charAt(0));
    for (char c : s.toCharArray()) {
        if (Character.isLetterOrDigit(c) ^ wasLetterOrDigit) {
            if (sb != null) {
                result.add(sb.toString());
            }
            sb = new StringBuilder();
            wasLetterOrDigit = !wasLetterOrDigit;
        }
        sb.append(c);
    }
    result.add(sb.toString());
    return result.toArray(new String[0]);
}
查看更多
登录 后发表回答