String line = "This order was placed for QT3000! OK?";
String pattern = "(.*)(\\d+)(.*)";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find()) {
System.out.println("Found value: " + m.group(1));
System.out.println("Found value: " + m.group(2));
System.out.println("Found value: " + m.group(3));
}
output is
Found value: This order was placed for QT300
Found value: 0
Found value: ! OK?
Though i was expecting the output as
Found value: This order was placed for QT3000! OK?
Found value: 3000
Found value: This order was placed for QT3000! OK?
The reason for my expected output is
If pattern is "(.*)" output for m.group(1) is "This order was placed for QT3000! OK?"
If pattern is "(\\d+)" output for m.group(1) is "3000"
I don't know when I mention pattern as "(.*)(\\d+)(.*)"
; why I am not getting expected output?
It is due to the first (.*)
being too greedy and eat up as much as possible, while still allowing (\d+)(.*)
to match the rest of the string.
Basically, the match goes like this. At the beginning, the first .*
will gobble up the whole string:
This order was placed for QT3000! OK?
^
However, since we can't find a match for \d+
here, we backtrack:
This order was placed for QT3000! OK?
^
This order was placed for QT3000! OK?
^
...
This order was placed for QT3000! OK?
^
At this position, \d+
can be matched, so we proceed:
This order was placed for QT3000! OK?
^
and .*
will match the rest of the string.
That's the explanation for the output you see.
You can fix this problem by making the first (.*)
lazy:
(.*?)(\d+)(.*)
The search for match for (.*?)
will begin with empty string, and as it backtracks, it will gradually increase the amount of characters it gobbles up:
This order was placed for QT3000! OK?
^
This order was placed for QT3000! OK?
^
...
This order was placed for QT3000! OK?
^
At this point, \d+
can be matched, and .*
can also be matched, which finishes the matching attempt and the output will be as you expected.
The .*
is matching (and consuming) as much characters as it can before finding \\d+
. When it gets to \\d+
, only one number is enough for matching.
So, you need to make the .*
lazy:
(.*?)(\\d+)(.*)
Well, if you want to go into the details, .*
first matches the whole string, then backtracks one character at a time so that the regex can also match (\\d+)(.*)
which comes later on. Once it has backtracked to the last character here:
This order was placed for QT300
The rest of the regex ((\\d+)(.*)
) is satisfied so the matching ends.