Finding all of the matching substrings, not only t

2020-02-01 01:10发布

The code

String s = "y z a a a b c c z";
Pattern p = Pattern.compile("(a )+(b )+(c *)c");
Matcher m = p.matcher(s);
while (m.find()) {
    System.out.println(m.group());
}

prints

a a a b c c

which is right.

But logically, the substrings

a a a b c
a a b c c
a a b c
a b c c
a b c

match the regex too.

So, how can I make the code find those substrings too, i.e. not only the most extended one, but also its children?

5条回答
男人必须洒脱
2楼-- · 2020-02-01 01:32

The only way I can think of here would be to generate a list of all possible substrings of your original string and match the regex against each of them, retaining those items where it matched.

查看更多
做自己的国王
3楼-- · 2020-02-01 01:45

Given these very specific constraints (i.e. this is not a general case solution), this will work:

import java.util.Set;
import java.util.TreeSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class test {

    public static void main(String[] args) {

        String s = "y z a a a b c c z";

        Pattern p = Pattern.compile("(a )+(b )+(c ?)+");
        Set<String> set = recurse(s, p, 0);
    }

    public static Set<String> recurse(String s, Pattern p, int depth) {
        int temp = depth;
        while(temp>0) {
            System.out.print("  ");
            temp--;
        }
        System.out.println("-> " +s);

        Matcher matcher = p.matcher(s);
        Set<String> set = new TreeSet<String>();

        if(matcher.find()) {
            String found = matcher.group().trim();
            set.add(found);
            set.addAll(recurse(found.substring(1), p, depth+1));
            set.addAll(recurse(found.substring(0, found.length()-1), p, depth+1));
        }

        while(depth>0) {
            System.out.print("  ");
            depth--;
        }
        System.out.println("<- " +s);
        return set;
    }
}

I'm reasonably sure you can adapt it to work on other cases, but the recursion into the matched string means that overlapping matches (like the one pointed out by @ahenderson) won't work.

查看更多
放我归山
4楼-- · 2020-02-01 01:47

You can use the reluctant qualifiers such as *? and +?. These match as little as possible, in contrast to the standard * and + which are greedy, i.e. match as much as possible. Still, this only allows you to find particular "sub-matches", not all of them. Some more control can be achieved using lookahead controlling non-capturing groups, also described in the docs. But in order to really find all sub-matches, you would probably have to do stuff yourself, i.e. build the automaton to which the regex corresponds and navigate it using custom code.

查看更多
Animai°情兽
5楼-- · 2020-02-01 01:54

I don't know of any regex engines that can give back all valid matches.

But we can apply a bit of logic to generate all candidates string and present it to the regex.

A candidate is constructed by enumerating all possible substring of a given input.

var str = "y z a a a b c c z y z a a a b c c z";
var regex = new Regex("(a )+(b )+(c *)c");

var length = str.Length;

for (int start = 1; start <= length;start++){

    for (int groupLength = 1;  start + groupLength - 1 <= length ;groupLength++){

        var candidate = str.Substring(start-1,groupLength); //.Dump();

        //("\"" + candidate + "\"").Dump();

        var match = regex.Match(candidate);

        if (match.Value == candidate )
        {
            candidate.Dump();
        }

    }
}

This gives

a a a b c c 
a a b c c 
a b c c 

which seems the correct answer but contradicts your result :

a a a b c => I state that this is not a match
a a b c c ok
a a b c => I state that this is not a match
a b c c ok
a b c => I state that this is not a match

For example, the regex that you give

(a )+(b )+(c *)c

does not match the first entry in your result

a a a b c 

The logic above can generate identical matches if you consider starting position not important. For example if you just repeat the given input another time :

"y z a a a b c c z y z a a a b c c z"

It will give :

a a a b c c
a a b c c
a b c c
a a a b c c
a a b c c
a b c c

If you consider position not important you should do a distinct on this result

The trivial case where the input is the empty string should als be added if considered a potential match.

FYI, this are all the candidates that the regex examines

"y"
"y "
"y z"
"y z "
"y z a"
"y z a "
"y z a a"
"y z a a "
"y z a a a"
"y z a a a "
"y z a a a b"
"y z a a a b "
"y z a a a b c"
"y z a a a b c "
"y z a a a b c c"
"y z a a a b c c "
"y z a a a b c c z"
" "
" z"
" z "
" z a"
" z a "
" z a a"
" z a a "
" z a a a"
" z a a a "
" z a a a b"
" z a a a b "
" z a a a b c"
" z a a a b c "
" z a a a b c c"
" z a a a b c c "
" z a a a b c c z"
"z"
"z "
"z a"
"z a "
"z a a"
"z a a "
"z a a a"
"z a a a "
"z a a a b"
"z a a a b "
"z a a a b c"
"z a a a b c "
"z a a a b c c"
"z a a a b c c "
"z a a a b c c z"
" "
" a"
" a "
" a a"
" a a "
" a a a"
" a a a "
" a a a b"
" a a a b "
" a a a b c"
" a a a b c "
" a a a b c c"
" a a a b c c "
" a a a b c c z"
"a"
"a "
"a a"
"a a "
"a a a"
"a a a "
"a a a b"
"a a a b "
"a a a b c"
"a a a b c "
"a a a b c c"
"a a a b c c "
"a a a b c c z"
" "
" a"
" a "
" a a"
" a a "
" a a b"
" a a b "
" a a b c"
" a a b c "
" a a b c c"
" a a b c c "
" a a b c c z"
"a"
"a "
"a a"
"a a "
"a a b"
"a a b "
"a a b c"
"a a b c "
"a a b c c"
"a a b c c "
"a a b c c z"
" "
" a"
" a "
" a b"
" a b "
" a b c"
" a b c "
" a b c c"
" a b c c "
" a b c c z"
"a"
"a "
"a b"
"a b "
"a b c"
"a b c "
"a b c c"
"a b c c "
"a b c c z"
" "
" b"
" b "
" b c"
" b c "
" b c c"
" b c c "
" b c c z"
"b"
"b "
"b c"
"b c "
"b c c"
"b c c "
"b c c z"
" "
" c"
" c "
" c c"
" c c "
" c c z"
"c"
"c "
"c c"
"c c "
"c c z"
" "
" c"
" c "
" c z"
"c"
"c "
"c z"
" "
" z"
"z"

Also it's good to know how the 2 main types of regexes (NFA and DFA) do their work

from http://msdn.microsoft.com/en-us/library/e347654k.aspx

.NET (and JAVA too I think) are NFA regex engines (as opposed to DFA) and as it processes a particular language element, the engine uses greedy matching; that is, it matches as much of the input string as it possibly can. But it also saves its state after successfully matching a subexpression. If a match eventually fails, the engine can return to a saved state so it can try additional matches. This process of abandoning a successful subexpression match so that later language elements in the regular expression can also match is known as backtracking.

查看更多
唯我独甜
6楼-- · 2020-02-01 01:58

You will need a lazy quantifier.

Please try the following:

Pattern p = Pattern.compile("(a )+(b )+((c )*?)c");

Please also notice, that I grouped "c" once again, since I think that's what you want. Otherwise you would find arbitrarily many spaces, but not "c".

查看更多
登录 后发表回答