I have a multiline string which is delimited by a set of different delimiters:
(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)
I can split this string into its parts, using String.split
, but it seems that I can't get the actual string, which matched the delimiter regex.
In other words, this is what I get:
Text1
Text2
Text3
Text4
This is what I want
Text1
DelimiterA
Text2
DelimiterC
Text3
DelimiterB
Text4
Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?
I don't think it is possible with
String#split
, but you can use aStringTokenizer
, though that won't allow you to define your delimiter as a regex, but only as a class of single-digit characters:Here's a groovy version based on some of the code above, in case it helps. It's short, anyway. Conditionally includes the head and tail (if they are not empty). The last part is a demo/test case.
You can use Lookahead and Lookbehind. Like this:
And you will get:
The last one is what you want.
((?<=;)|(?=;))
equals to select an empty character before;
or after;
.Hope this helps.
EDIT Fabian Steeg comments on Readability is valid. Readability is always the problem for RegEx. One thing, I do to help easing this is to create a variable whose name represent what the regex does and use Java String format to help that. Like this:
This helps a little bit. :-D
I got here late, but returning to the original question, why not just use lookarounds?
output:
EDIT: What you see above is what appears on the command line when I run that code, but I now see that it's a bit confusing. It's difficult to keep track of which commas are part of the result and which were added by
Arrays.toString()
. SO's syntax highlighting isn't helping either. In hopes of getting the highlighting to work with me instead of against me, here's how those arrays would look it I were declaring them in source code:I hope that's easier to read. Thanks for the heads-up, @finnw.
If you are worried about the complications look-ahead/look-behind stuff might introduce, and just want a rock-solid utility method that can cope with any token pattern and any separators you throw at it. (Which is probably the case!)
NB surprised to find that the Apache Commons people don't seem to have provided this, e.g. in
StringUtils
.Also I suggest that this should be a flag in
Pattern
: i..eINCLUDE_SEPARATORS
.But this is pretty simple if you use the
Pattern
andMatcher
classes right:A very naive solution, that doesn't involve regex would be to perform a string replace on your delimiter along the lines of (assuming comma for delimiter):
Where you can replace tilda (~) with an appropriate unique delimiter.
Then if you do a split on your new delimiter then i believe you will get the desired result.