How to split a string, but also keep the delimiter

2018-12-31 02:20发布

I have a multiline string which is delimited by a set of different delimiters:

(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)

I can split this string into its parts, using String.split, but it seems that I can't get the actual string, which matched the delimiter regex.

In other words, this is what I get:

  • Text1
  • Text2
  • Text3
  • Text4

This is what I want

  • Text1
  • DelimiterA
  • Text2
  • DelimiterC
  • Text3
  • DelimiterB
  • Text4

Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?

标签: java
23条回答
泛滥B
2楼-- · 2018-12-31 02:22

A very naive solution, that doesn't involve regex would be to perform a string replace on your delimiter along the lines of (assuming comma for delimiter):

string.replace(FullString, "," , "~,~")

Where you can replace tilda (~) with an appropriate unique delimiter.

Then if you do a split on your new delimiter then i believe you will get the desired result.

查看更多
流年柔荑漫光年
3楼-- · 2018-12-31 02:22

I will post my working versions also(first is really similar to Markus).

public static String[] splitIncludeDelimeter(String regex, String text){
    List<String> list = new LinkedList<>();
    Matcher matcher = Pattern.compile(regex).matcher(text);

    int now, old = 0;
    while(matcher.find()){
        now = matcher.end();
        list.add(text.substring(old, now));
        old = now;
    }

    if(list.size() == 0)
        return new String[]{text};

    //adding rest of a text as last element
    String finalElement = text.substring(old);
    list.add(finalElement);

    return list.toArray(new String[list.size()]);
}

And here is second solution and its round 50% faster than first one:

public static String[] splitIncludeDelimeter2(String regex, String text){
    List<String> list = new LinkedList<>();
    Matcher matcher = Pattern.compile(regex).matcher(text);

    StringBuffer stringBuffer = new StringBuffer();
    while(matcher.find()){
        matcher.appendReplacement(stringBuffer, matcher.group());
        list.add(stringBuffer.toString());
        stringBuffer.setLength(0); //clear buffer
    }

    matcher.appendTail(stringBuffer); ///dodajemy reszte  ciagu
    list.add(stringBuffer.toString());

    return list.toArray(new String[list.size()]);
}
查看更多
忆尘夕之涩
4楼-- · 2018-12-31 02:23

I know this is a very-very old question and answer has also been accepted. But still I would like to submit a very simple answer to original question. Consider this code:

String str = "Hello-World:How\nAre You&doing";
inputs = str.split("(?!^)\\b");
for (int i=0; i<inputs.length; i++) {
   System.out.println("a[" + i + "] = \"" + inputs[i] + '"');
}

OUTPUT:

a[0] = "Hello"
a[1] = "-"
a[2] = "World"
a[3] = ":"
a[4] = "How"
a[5] = "
"
a[6] = "Are"
a[7] = " "
a[8] = "You"
a[9] = "&"
a[10] = "doing"

I am just using word boundary \b to delimit the words except when it is start of text.

查看更多
刘海飞了
5楼-- · 2018-12-31 02:26

I don't know of an existing function in the Java API that does this (which is not to say it doesn't exist), but here's my own implementation (one or more delimiters will be returned as a single token; if you want each delimiter to be returned as a separate token, it will need a bit of adaptation):

static String[] splitWithDelimiters(String s) {
    if (s == null || s.length() == 0) {
        return new String[0];
    }
    LinkedList<String> result = new LinkedList<String>();
    StringBuilder sb = null;
    boolean wasLetterOrDigit = !Character.isLetterOrDigit(s.charAt(0));
    for (char c : s.toCharArray()) {
        if (Character.isLetterOrDigit(c) ^ wasLetterOrDigit) {
            if (sb != null) {
                result.add(sb.toString());
            }
            sb = new StringBuilder();
            wasLetterOrDigit = !wasLetterOrDigit;
        }
        sb.append(c);
    }
    result.add(sb.toString());
    return result.toArray(new String[0]);
}
查看更多
回忆,回不去的记忆
6楼-- · 2018-12-31 02:29

You can use Lookahead and Lookbehind. Like this:

System.out.println(Arrays.toString("a;b;c;d".split("(?<=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("(?=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("((?<=;)|(?=;))")));

And you will get:

[a;, b;, c;, d]
[a, ;b, ;c, ;d]
[a, ;, b, ;, c, ;, d]

The last one is what you want.

((?<=;)|(?=;)) equals to select an empty character before ; or after ;.

Hope this helps.

EDIT Fabian Steeg comments on Readability is valid. Readability is always the problem for RegEx. One thing, I do to help easing this is to create a variable whose name represent what the regex does and use Java String format to help that. Like this:

static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";
...
public void someMethod() {
...
final String[] aEach = "a;b;c;d".split(String.format(WITH_DELIMITER, ";"));
...
}
...

This helps a little bit. :-D

查看更多
柔情千种
7楼-- · 2018-12-31 02:29

I like the idea of StringTokenizer because it is Enumerable.
But it is also obsolete, and replace by String.split which return a boring String[] (and does not includes the delimiters).

So I implemented a StringTokenizerEx which is an Iterable, and which takes a true regexp to split a string.

A true regexp means it is not a 'Character sequence' repeated to form the delimiter:
'o' will only match 'o', and split 'ooo' into three delimiter, with two empty string inside:

[o], '', [o], '', [o]

But the regexp o+ will return the expected result when splitting "aooob"

[], 'a', [ooo], 'b', []

To use this StringTokenizerEx:

final StringTokenizerEx aStringTokenizerEx = new StringTokenizerEx("boo:and:foo", "o+");
final String firstDelimiter = aStringTokenizerEx.getDelimiter();
for(String aString: aStringTokenizerEx )
{
    // uses the split String detected and memorized in 'aString'
    final nextDelimiter = aStringTokenizerEx.getDelimiter();
}

The code of this class is available at DZone Snippets.

As usual for a code-challenge response (one self-contained class with test cases included), copy-paste it (in a 'src/test' directory) and run it. Its main() method illustrates the different usages.


Note: (late 2009 edit)

The article Final Thoughts: Java Puzzler: Splitting Hairs does a good work explaning the bizarre behavior in String.split().
Josh Bloch even commented in response to that article:

Yes, this is a pain. FWIW, it was done for a very good reason: compatibility with Perl.
The guy who did it is Mike "madbot" McCloskey, who now works with us at Google. Mike made sure that Java's regular expressions passed virtually every one of the 30K Perl regular expression tests (and ran faster).

The Google common-library Guava contains also a Splitter which is:

  • simpler to use
  • maintained by Google (and not by you)

So it may worth being checked out. From their initial rough documentation (pdf):

JDK has this:

String[] pieces = "foo.bar".split("\\.");

It's fine to use this if you want exactly what it does: - regular expression - result as an array - its way of handling empty pieces

Mini-puzzler: ",a,,b,".split(",") returns...

(a) "", "a", "", "b", ""
(b) null, "a", null, "b", null
(c) "a", null, "b"
(d) "a", "b"
(e) None of the above

Answer: (e) None of the above.

",a,,b,".split(",")
returns
"", "a", "", "b"

Only trailing empties are skipped! (Who knows the workaround to prevent the skipping? It's a fun one...)

In any case, our Splitter is simply more flexible: The default behavior is simplistic:

Splitter.on(',').split(" foo, ,bar, quux,")
--> [" foo", " ", "bar", " quux", ""]

If you want extra features, ask for them!

Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split(" foo, ,bar, quux,")
--> ["foo", "bar", "quux"]

Order of config methods doesn't matter -- during splitting, trimming happens before checking for empties.

查看更多
登录 后发表回答