I have a multiline string which is delimited by a set of different delimiters:
(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)
I can split this string into its parts, using String.split
, but it seems that I can't get the actual string, which matched the delimiter regex.
In other words, this is what I get:
Text1
Text2
Text3
Text4
This is what I want
Text1
DelimiterA
Text2
DelimiterC
Text3
DelimiterB
Text4
Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?
Here is a simple clean implementation which is consistent with
Pattern#split
and works with variable length patterns, which look behind cannot support, and it is easier to use. It is similar to the solution provided by @cletus.I don't do null checks here,
Pattern#split
doesn't, why should I. I don't like theif
at the end but it is required for consistency with thePattern#split
. Otherwise I would unconditionally append, resulting in an empty string as the last element of the result if the input string ends with the pattern.I convert to String[] for consistency with
Pattern#split
, I usenew String[0]
rather thannew String[result.size()]
, see here for why.Here are my tests:
If you can afford, use Java's replace(CharSequence target, CharSequence replacement) method and fill in another delimiter to split with. Example: I want to split the string "boo:and:foo" and keep ':' at its righthand String.
Important note: This only works if you have no further "newdelimiter" in your String! Thus, it is not a general solution. But if you know a CharSequence of which you can be sure that it will never appear in the String, this is a very simple solution.
You want to use lookarounds, and split on zero-width matches. Here are some examples:
And yes, that is triply-nested assertion there in the last pattern.
Related questions
See also
Another candidate solution using a regex. Retains token order, correctly matches multiple tokens of the same type in a row. The downside is that the regex is kind of nasty.
Sample output:
If you are worried about the complications look-ahead/look-behind stuff might introduce, and just want a rock-solid utility method that can cope with any token pattern and any separators you throw at it. (Which is probably the case!)
NB surprised to find that the Apache Commons people don't seem to have provided this, e.g. in
StringUtils
.Also I suggest that this should be a flag in
Pattern
: i..eINCLUDE_SEPARATORS
.But this is pretty simple if you use the
Pattern
andMatcher
classes right: