Is it possible to create a regular expression with a variable number of groups?
After running this for instance...
Pattern p = Pattern.compile(\"ab([cd])*ef\");
Matcher m = p.matcher(\"abcddcef\");
m.matches();
... I would like to have something like
m.group(1)
= \"c\"
m.group(2)
= \"d\"
m.group(3)
= \"d\"
m.group(4)
= \"c\"
.
(Background: I\'m parsing some lines of data, and one of the \"fields\" is repeating. I would like to avoid a matcher.find
loop for these fields.)
As pointed out by @Tim Pietzcker in the comments, perl6 and .NET have this feature.
According to the documentation, Java regular expressions can\'t do this:
The captured input associated with a
group is always the subsequence that
the group most recently matched. If a
group is evaluated a second time
because of quantification then its
previously-captured value, if any,
will be retained if the second
evaluation fails. Matching the string
\"aba\" against the expression (a(b)?)+,
for example, leaves group two set to
\"b\". All captured input is discarded
at the beginning of each match.
(emphasis added)
Pattern p = Pattern.compile(\"ab(?:(c)|(d))*ef\");
Matcher m = p.matcher(\"abcdef\");
m.matches();
should do what you want.
EDIT:
@aioobe, I understand now. You want to be able to do something like the grammar
A ::== <Foo> <Bars> <Baz>
Foo ::== \"foo\"
Baz ::== \"baz\"
Bars ::== <Bar> <Bars>
| ε
Bar ::== \"A\"
| \"B\"
and pull out all the individual matches of Bar
.
No, there is no way to do that using java.util.regex
. You can recurse and use a regex on the match of Bars
or use a parser generator like ANTLR and attach a side-effect to Bar
.
You can use split to get the fields you need into an array and loop through that.
http://download.oracle.com/javase/1,5.0/docs/api/java/lang/String.html#split(java.lang.String)
I have not used java regex, but for many languages the answer is: No.
Capturing groups seem to be created when the regex is parsed, and filled when it matches the string. The expression (a)|(b)(c)
has three capturing groups, only if either one, or two of them can be filled. (a)*
has just one group, the parser leaves the last match in the group after matching.
I would think that backtracking inhibits this behavior, and say the effect of /([\\S\\s])/
in its grouping accumulative state on something like the Bible. Even if it can be done, the output is unknowable as the groups will lose positional meaning. Its better to do a separate regex on like kind in a global sense and have it deposited into an array.
I have just had the very similar problem, and managed to do \"variable number of groups\" but a combination of a while loop and resetting the matcher.
int i=0;
String m1=null, m2=null;
while(matcher.find(i) && (m1=matcher.group(1))!=null && (m2=matcher.group(2))!=null)
{
// do work on two found groups
i=matcher.end();
}
But this is for my problem (with two repeating
Pattern pattern = Pattern.compile(\"(?<=^ab[cd]{0,100})[cd](?=[cd]{0,100}ef$)\");
Matcher matcher = pattern.matcher(\"abcddcef\")
int i=0;
String res=null;
while(matcher.find(i) && (res=matcher.group())!=null)
{
System.out.println(res);
i=matcher.end();
}
You lose the ability to specify arbitrary length of repetition with *
or +
because look-ahead and look-behind must be of the predictable length.