The code
String s = "y z a a a b c c z";
Pattern p = Pattern.compile("(a )+(b )+(c *)c");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group());
}
prints
a a a b c c
which is right.
But logically, the substrings
a a a b c
a a b c c
a a b c
a b c c
a b c
match the regex too.
So, how can I make the code find those substrings too, i.e. not only the most extended one, but also its children?
The only way I can think of here would be to generate a list of all possible substrings of your original string and match the regex against each of them, retaining those items where it matched.
Given these very specific constraints (i.e. this is not a general case solution), this will work:
I'm reasonably sure you can adapt it to work on other cases, but the recursion into the matched string means that overlapping matches (like the one pointed out by @ahenderson) won't work.
You can use the reluctant qualifiers such as
*?
and+?
. These match as little as possible, in contrast to the standard*
and+
which are greedy, i.e. match as much as possible. Still, this only allows you to find particular "sub-matches", not all of them. Some more control can be achieved using lookahead controlling non-capturing groups, also described in the docs. But in order to really find all sub-matches, you would probably have to do stuff yourself, i.e. build the automaton to which the regex corresponds and navigate it using custom code.I don't know of any regex engines that can give back all valid matches.
But we can apply a bit of logic to generate all candidates string and present it to the regex.
A candidate is constructed by enumerating all possible substring of a given input.
This gives
which seems the correct answer but contradicts your result :
For example, the regex that you give
does not match the first entry in your result
The logic above can generate identical matches if you consider starting position not important. For example if you just repeat the given input another time :
It will give :
If you consider position not important you should do a distinct on this result
The trivial case where the input is the empty string should als be added if considered a potential match.
FYI, this are all the candidates that the regex examines
Also it's good to know how the 2 main types of regexes (NFA and DFA) do their work
from http://msdn.microsoft.com/en-us/library/e347654k.aspx
You will need a lazy quantifier.
Please try the following:
Please also notice, that I grouped "
c
" once again, since I think that's what you want. Otherwise you would find arbitrarily many spaces, but not "c
".