The code
String s = "y z a a a b c c z";
Pattern p = Pattern.compile("(a )+(b )+(c *)c");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group());
}
prints
a a a b c c
which is right.
But logically, the substrings
a a a b c
a a b c c
a a b c
a b c c
a b c
match the regex too.
So, how can I make the code find those substrings too, i.e. not only the most extended one, but also its children?
You can use the reluctant qualifiers such as *?
and +?
. These match as little as possible, in contrast to the standard *
and +
which are greedy, i.e. match as much as possible. Still, this only allows you to find particular "sub-matches", not all of them. Some more control can be achieved using lookahead controlling non-capturing groups, also described in the docs. But in order to really find all sub-matches, you would probably have to do stuff yourself, i.e. build the automaton to which the regex corresponds and navigate it using custom code.
You will need a lazy quantifier.
Please try the following:
Pattern p = Pattern.compile("(a )+(b )+((c )*?)c");
Please also notice, that I grouped "c
" once again, since I think that's what you want. Otherwise you would find arbitrarily many spaces, but not "c
".
The only way I can think of here would be to generate a list of all possible substrings of your original string and match the regex against each of them, retaining those items where it matched.
I don't know of any regex engines that can give back all valid matches.
But we can apply a bit of logic to generate all candidates string and present it to the regex.
A candidate is constructed by enumerating all possible substring of a given input.
var str = "y z a a a b c c z y z a a a b c c z";
var regex = new Regex("(a )+(b )+(c *)c");
var length = str.Length;
for (int start = 1; start <= length;start++){
for (int groupLength = 1; start + groupLength - 1 <= length ;groupLength++){
var candidate = str.Substring(start-1,groupLength); //.Dump();
//("\"" + candidate + "\"").Dump();
var match = regex.Match(candidate);
if (match.Value == candidate )
{
candidate.Dump();
}
}
}
This gives
a a a b c c
a a b c c
a b c c
which seems the correct answer but contradicts your result :
a a a b c => I state that this is not a match
a a b c c ok
a a b c => I state that this is not a match
a b c c ok
a b c => I state that this is not a match
For example, the regex that you give
(a )+(b )+(c *)c
does not match the first entry in your result
a a a b c
The logic above can generate identical matches if you consider starting position not important. For example if you just repeat the given input another time :
"y z a a a b c c z y z a a a b c c z"
It will give :
a a a b c c
a a b c c
a b c c
a a a b c c
a a b c c
a b c c
If you consider position not important you should do a distinct on this result
The trivial case where the input is the empty string should als be added if considered a potential match.
FYI, this are all the candidates that the regex examines
"y"
"y "
"y z"
"y z "
"y z a"
"y z a "
"y z a a"
"y z a a "
"y z a a a"
"y z a a a "
"y z a a a b"
"y z a a a b "
"y z a a a b c"
"y z a a a b c "
"y z a a a b c c"
"y z a a a b c c "
"y z a a a b c c z"
" "
" z"
" z "
" z a"
" z a "
" z a a"
" z a a "
" z a a a"
" z a a a "
" z a a a b"
" z a a a b "
" z a a a b c"
" z a a a b c "
" z a a a b c c"
" z a a a b c c "
" z a a a b c c z"
"z"
"z "
"z a"
"z a "
"z a a"
"z a a "
"z a a a"
"z a a a "
"z a a a b"
"z a a a b "
"z a a a b c"
"z a a a b c "
"z a a a b c c"
"z a a a b c c "
"z a a a b c c z"
" "
" a"
" a "
" a a"
" a a "
" a a a"
" a a a "
" a a a b"
" a a a b "
" a a a b c"
" a a a b c "
" a a a b c c"
" a a a b c c "
" a a a b c c z"
"a"
"a "
"a a"
"a a "
"a a a"
"a a a "
"a a a b"
"a a a b "
"a a a b c"
"a a a b c "
"a a a b c c"
"a a a b c c "
"a a a b c c z"
" "
" a"
" a "
" a a"
" a a "
" a a b"
" a a b "
" a a b c"
" a a b c "
" a a b c c"
" a a b c c "
" a a b c c z"
"a"
"a "
"a a"
"a a "
"a a b"
"a a b "
"a a b c"
"a a b c "
"a a b c c"
"a a b c c "
"a a b c c z"
" "
" a"
" a "
" a b"
" a b "
" a b c"
" a b c "
" a b c c"
" a b c c "
" a b c c z"
"a"
"a "
"a b"
"a b "
"a b c"
"a b c "
"a b c c"
"a b c c "
"a b c c z"
" "
" b"
" b "
" b c"
" b c "
" b c c"
" b c c "
" b c c z"
"b"
"b "
"b c"
"b c "
"b c c"
"b c c "
"b c c z"
" "
" c"
" c "
" c c"
" c c "
" c c z"
"c"
"c "
"c c"
"c c "
"c c z"
" "
" c"
" c "
" c z"
"c"
"c "
"c z"
" "
" z"
"z"
Also it's good to know how the 2 main types of regexes (NFA and DFA) do their work
from http://msdn.microsoft.com/en-us/library/e347654k.aspx
.NET (and JAVA too I think) are NFA regex engines (as opposed to DFA)
and as it processes a particular language element, the engine uses
greedy matching; that is, it matches as much of the input string as it
possibly can. But it also saves its state after successfully matching
a subexpression. If a match eventually fails, the engine can return to
a saved state so it can try additional matches. This process of
abandoning a successful subexpression match so that later language
elements in the regular expression can also match is known as
backtracking.
Given these very specific constraints (i.e. this is not a general case solution), this will work:
import java.util.Set;
import java.util.TreeSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class test {
public static void main(String[] args) {
String s = "y z a a a b c c z";
Pattern p = Pattern.compile("(a )+(b )+(c ?)+");
Set<String> set = recurse(s, p, 0);
}
public static Set<String> recurse(String s, Pattern p, int depth) {
int temp = depth;
while(temp>0) {
System.out.print(" ");
temp--;
}
System.out.println("-> " +s);
Matcher matcher = p.matcher(s);
Set<String> set = new TreeSet<String>();
if(matcher.find()) {
String found = matcher.group().trim();
set.add(found);
set.addAll(recurse(found.substring(1), p, depth+1));
set.addAll(recurse(found.substring(0, found.length()-1), p, depth+1));
}
while(depth>0) {
System.out.print(" ");
depth--;
}
System.out.println("<- " +s);
return set;
}
}
I'm reasonably sure you can adapt it to work on other cases, but the recursion into the matched string means that overlapping matches (like the one pointed out by @ahenderson) won't work.