I have a multiline string which is delimited by a set of different delimiters:
(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)
I can split this string into its parts, using String.split
, but it seems that I can't get the actual string, which matched the delimiter regex.
In other words, this is what I get:
Text1
Text2
Text3
Text4
This is what I want
Text1
DelimiterA
Text2
DelimiterC
Text3
DelimiterB
Text4
Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?
A very naive solution, that doesn't involve regex would be to perform a string replace on your delimiter along the lines of (assuming comma for delimiter):
Where you can replace tilda (~) with an appropriate unique delimiter.
Then if you do a split on your new delimiter then i believe you will get the desired result.
I will post my working versions also(first is really similar to Markus).
And here is second solution and its round 50% faster than first one:
I know this is a very-very old question and answer has also been accepted. But still I would like to submit a very simple answer to original question. Consider this code:
OUTPUT:
I am just using word boundary
\b
to delimit the words except when it is start of text.I don't know of an existing function in the Java API that does this (which is not to say it doesn't exist), but here's my own implementation (one or more delimiters will be returned as a single token; if you want each delimiter to be returned as a separate token, it will need a bit of adaptation):
You can use Lookahead and Lookbehind. Like this:
And you will get:
The last one is what you want.
((?<=;)|(?=;))
equals to select an empty character before;
or after;
.Hope this helps.
EDIT Fabian Steeg comments on Readability is valid. Readability is always the problem for RegEx. One thing, I do to help easing this is to create a variable whose name represent what the regex does and use Java String format to help that. Like this:
This helps a little bit. :-D
I like the idea of StringTokenizer because it is Enumerable.
But it is also obsolete, and replace by String.split which return a boring String[] (and does not includes the delimiters).
So I implemented a StringTokenizerEx which is an Iterable, and which takes a true regexp to split a string.
A true regexp means it is not a 'Character sequence' repeated to form the delimiter:
'o' will only match 'o', and split 'ooo' into three delimiter, with two empty string inside:
But the regexp o+ will return the expected result when splitting "aooob"
To use this StringTokenizerEx:
The code of this class is available at DZone Snippets.
As usual for a code-challenge response (one self-contained class with test cases included), copy-paste it (in a 'src/test' directory) and run it. Its main() method illustrates the different usages.
Note: (late 2009 edit)
The article Final Thoughts: Java Puzzler: Splitting Hairs does a good work explaning the bizarre behavior in
String.split()
.Josh Bloch even commented in response to that article:
The Google common-library Guava contains also a Splitter which is:
So it may worth being checked out. From their initial rough documentation (pdf):