Regex pattern for split

2019-01-29 14:41发布

I would like to resolve this problem.

  • , comma : split terms
  • " double quote : String value (ignore special char)
  • [] array

For instance:

input : a=1,b="1,2,3",c=[d=1,e="1,2,3"]

expected output:

    a=1
    b="1,2,3"
    c=[d=1,e="1,2,3"]

But I could not get above result.

I have written the code below:

 String line = "a=1,b=\"1,2,3\",c=[d=1,e=\"1,11\"]";
 String[] tokens = line.split(",(?=(([^\"]*\"){2})*[^\"]*$)");
 for (String t : tokens)
      System.out.println("> " + t);

and my output is:

a=1
b="1,2,3"
c=[d=1
e="1,11"]

What do I need to change to get the expected output? Should I stick to a regular expression or might another solution be more flexible and easier to maintain?

2条回答
劫难
2楼-- · 2019-01-29 14:57

This regex does the trick:

",(?=(([^\"]*\"){2})*[^\"]*$)(?=([^\\[]*?\\[[^\\]]*\\][^\\[\\]]*?)*$)"

It works by adding a look-ahead for matching pairs of square brackets after the comma - if you're inside a square-bracketed term, of course you won't have balanced brackets following.

Here's some test code:

String line = "a=1,b=\"1,2,3\",c=[d=1,e=\"1,11\"]";
String[] tokens = line.split(",(?=(([^\"]*\"){2})*[^\"]*$)(?=([^\\[]*?\\[[^\\]]*\\][^\\[\\]]*?)*$)");
for (String t : tokens)
    System.out.println(t);

Output:

a=1
b="1,2,3"
c=[d=1,e="1,11"]
查看更多
Lonely孤独者°
3楼-- · 2019-01-29 15:14

I know the question is nearly a year old, but... this regex is much simpler:

\[[^]]*\]|"[^"]*"|(,)
  • The leftmost branch of the | matches [complete brackets]
  • The next side of the | matches \"strings like this\"
  • The right side captures commas to Group 1, and we know they are the right commas because they weren't matched by the expressions on the left
  • All we need to do is split on Group 1

Splitting on Group 1 Captures

You can do it like this (see the output at the bottom of the online demo):

String subject = "a=1,b=\"1,2,3\",c=[d=1,e=\"1,11\"]";
Pattern regex = Pattern.compile("\\[[^]]*\\]|\".*?\"|(,)");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "@@SplitHere@@");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("@@SplitHere@@");
for (String split : splits) System.out.println(split);

This is a two-step split: first, we replace the commas with something distinctive, such as @@SplitHere@@

Pros and Cons

  • The main benefit of this technique is that it is extremely easy to understand and maintain. If you suddenly decide to exclude commas {inside , curlies}, you just add another OR branch to the left of the regex: {[^{}]*}
  • When you are familiar with it, you can use it in many contexts
  • In this case, the main drawback is that we proceed in two steps as we replace before splitting. In my view, with modern processors that's irrelevant. Maintainable code is much more important.

Reference

This technique has many applications. It is fully explained in these two links.

查看更多
登录 后发表回答