Expected outcome in group capture?

2019-04-15 08:21发布

问题:

String line = "This order was placed for QT3000! OK?";
    String pattern = "(.*)(\\d+)(.*)";

    // Create a Pattern object
    Pattern r = Pattern.compile(pattern);

    // Now create matcher object.
    Matcher m = r.matcher(line);
    if (m.find()) {
      System.out.println("Found value: " + m.group(1));
      System.out.println("Found value: " + m.group(2));
      System.out.println("Found value: " + m.group(3));
    }

output is

Found value: This order was placed for QT300
Found value: 0
Found value: ! OK?

Though i was expecting the output as

Found value: This order was placed for QT3000! OK?
Found value: 3000
Found value: This order was placed for QT3000! OK?

The reason for my expected output is

If pattern is  "(.*)"   output for m.group(1) is "This order was placed for QT3000! OK?"
If pattern is  "(\\d+)" output for m.group(1) is "3000"

I don't know when I mention pattern as "(.*)(\\d+)(.*)"; why I am not getting expected output?

回答1:

It is due to the first (.*) being too greedy and eat up as much as possible, while still allowing (\d+)(.*) to match the rest of the string.

Basically, the match goes like this. At the beginning, the first .* will gobble up the whole string:

This order was placed for QT3000! OK?
                                     ^

However, since we can't find a match for \d+ here, we backtrack:

This order was placed for QT3000! OK?
                                    ^
This order was placed for QT3000! OK?
                                   ^
...

This order was placed for QT3000! OK?
                               ^

At this position, \d+ can be matched, so we proceed:

This order was placed for QT3000! OK?
                                ^

and .* will match the rest of the string.

That's the explanation for the output you see.


You can fix this problem by making the first (.*) lazy:

(.*?)(\d+)(.*)

The search for match for (.*?) will begin with empty string, and as it backtracks, it will gradually increase the amount of characters it gobbles up:

This order was placed for QT3000! OK?
^
This order was placed for QT3000! OK?
 ^
...

This order was placed for QT3000! OK?
                            ^

At this point, \d+ can be matched, and .* can also be matched, which finishes the matching attempt and the output will be as you expected.



回答2:

The .* is matching (and consuming) as much characters as it can before finding \\d+. When it gets to \\d+, only one number is enough for matching.

So, you need to make the .* lazy:

(.*?)(\\d+)(.*)

Well, if you want to go into the details, .* first matches the whole string, then backtracks one character at a time so that the regex can also match (\\d+)(.*) which comes later on. Once it has backtracked to the last character here:

This order was placed for QT300

The rest of the regex ((\\d+)(.*)) is satisfied so the matching ends.