TL;DR
What are the design decisions behind Matcher
's API?
Background
Matcher
has a behaviour that I didn't expect and for which I can't find a good reason. The API documentation says:
Once created, a matcher can be used to perform three different kinds of match operations: [...] Each of these methods returns a boolean indicating success or failure. More information about a successful match can be obtained by querying the state of the matcher.
What the API documentation further says is:
The explicit state of a matcher is initially undefined; attempting to query any part of it before a successful match will cause an IllegalStateException to be thrown.
Example
String s = "foo=23,bar=42";
Pattern p = Pattern.compile("foo=(?<foo>[0-9]*),bar=(?<bar>[0-9]*)");
Matcher matcher = p.matcher(s);
System.out.println(matcher.group("foo")); // (1)
System.out.println(matcher.group("bar"));
This code throws a
java.lang.IllegalStateException: No match found
at (1)
. To get around this, it is necessary to call matches()
or other methods that bring the Matcher
into a state that allows group()
. The following works:
String s = "foo=23,bar=42";
Pattern p = Pattern.compile("foo=(?<foo>[0-9]*),bar=(?<bar>[0-9]*)");
Matcher matcher = p.matcher(s);
matcher.matches(); // (2)
System.out.println(matcher.group("foo"));
System.out.println(matcher.group("bar"));
Adding the call to matches()
at (2)
sets the Matcher
into the proper state to call group()
.
Question, probably not constructive
Why is this API designed like this? Why not automatically match when the Matcher
is build with Patter.matcher(String)
?
My guess is the design decision was based on having queries that had clear, well defined semantics that didn't conflate existence with match properties.
Consider this: what would you expect Matcher queries to return if the matcher has not successfully matched something?
Let's first consider
group()
. If we haven't successfully matched something, Matcher shouldn't return the empty string, as it hasn't matched the empty string. We could returnnull
at this point.Ok, now let's consider
start()
andend()
. Each returnint
. Whatint
value would be valid in this case? Certainly no positive number. What negative number would be appropriate? -1?Given all this, a user is still going to have to check return values for every query to verify if a match occurred or not. Alternatively, you could check to see if it matches successfully outright, and if successful, the query semantics all have well-defined meaning. If not, the user gets consistent behaviour no matter which angle is queried.
I'll grant that re-using
IllegalStateException
may not have resulted in the best description of the error condition. But if we were to rename/subclassIllegalStateException
toNoSuccessfulMatchException
, one should be able to appreciate how the current design enforces query consistency and encourages the user to use queries that have semantics that are known to be defined at the time of asking.TL;DR: What is value of asking the specific cause of death of a living organism?
Actually, you misunderstood the documentation. Take a 2nd look at the statement you quoted: -
A matcher may throw
IllegalStateException
on accessingmatcher.group()
if no match was found.So, you need to use following test, to actually initiate the matching process: -
The below code: -
Just creates a
matcher
instance. This will not actually match a string. Even if there was a successful match. So, you need to check the following condition, to check for successful matches: -And if the condition in the
if
returnsfalse
, that means nothing was matched. So, if you usematcher.group()
without checking this condition, you will getIllegalStateException
if the match was not found.Suppose, if
Matcher
was designed the way you are saying, then you would have to do anull
check to check whether a match was found or not, to callmatcher.group()
, like this: -The way you think should have been done:-
But, what if, you want to print any further matches, since a pattern can be matched multiple times in a String, for that, there should be a way to tell the matcher to find the next match. But the
null
check would not be able to do that. For that you would have to move your matcher forward to match the next String. So, there are various methods defined inMatcher
class to serve the purpose. Thematcher.find()
method matches the String till all the matches is found.There are other methods also, that
match
the string in a different way, that depends on you how you want to match. So its ultimately onMatcher
class to do thematching
against the string.Pattern
class just creates apattern
to match against. If thePattern.matcher()
were tomatch
the pattern, then there has to be some way to define various ways tomatch
, asmatching
can be in different ways. So, there comes the need ofMatcher
class.So, the way it actually is: -
So, if there are 4 matches found in the string, your first way, would print only the first one, while the 2nd way will print all the matches, by moving the
matcher
forward to match the next pattern.I Hope that makes it clear.
The documentation of
Matcher
class describes the use of the three methods it provides, which says: -Unfortunately, I have not been able find any other official sources, saying explicitly Why and How of this issue.
My answer is very similar to Rohit Jain's but includes some reasons why the 'extra' step is necessary.
java.util.regex implementation
The line:
causes a new Pattern object to be allocated, and it internally stores a structure representing the RE - information such as a choice of characters, groups, sequences, greedy vs. non-greedy, repeats and so on.
This pattern is stateless and immutable, so it can be reused, is multi-theadable and optimizes well.
The lines:
returns a new
Matcher
object for thePattern
andString
- one that has not yet read the String.Matcher
is really just a state machine's state, where the state machine is thePattern
.The matching can be run by stepping the state machine through the matching process using the following API:
lookingAt()
: Attempts to match the input sequence, starting at the beginning, against the patternfind()
: Scans the input sequence looking for the next subsequence that matches the pattern.In both cases, the intermediate state can be read using the
start()
,end()
, andgroup()
methods.Benefits of this approach
Why would anyone want to do step through the parsing?
Get values from groups that have quantification greater than 1 (i.e. groups that repeat and end up matching more than once). For example in the trivial RE below that parses variable assignments:
See the section on "Group name" in "Groups and capturing" the JavaDoc on Pattern
However, on most occasions you do not need to step the state machine through the matching, so there is a convenience method (
matches
) which runs the pattern matching to completion.If a matcher would automatically match the input string, that would be wasted effort in case you wish to find the pattern.
A matcher can be used to check if the pattern
matches()
the input string, and it can be used tofind()
the pattern in the input string (even repeatedly to find all matching substrings). Until you call one of these two methods, the matcher does not know what test you want to perform, so it cannot give you any matched groups. Even if you do call one of these methods, the call may fail - the pattern is not found - and in that case a call togroup
must fail as well.You need to check the return value of
matcher.matches()
. It will returntrue
when a match was found,false
otherwise.If
matcher.matches()
does not find a match and you callmatcher.group(...)
, you'll still get anIllegalStateException
. That's exactly what the documentation says:When
matcher.match()
returnsfalse
, no successful match has been found and it doesn't make a lot of sense to get information on the match by calling for examplegroup()
.This is expected and documented.
The reason is that
.matches()
returns a boolean indicating if there was a match. If there was a match, then you can call.group(...)
meaningfully. Otherwise, if there's no match, a call to.group(...)
makes no sense. Therefore, you should not be allowed to call.group(...)
before callingmatches()
.The correct way to use a matcher is something like the following: