Is there an API method that returns all (possibly overlapping) substrings that match a regular expression?
For example, I have a text string: String t = 04/31 412-555-1235;
, and I have a pattern: Pattern p = new Pattern("\\d\\d+");
that matches strings of two or more characters.
The matches I get are: 04, 31, 412, 555, 1235.
How do I get overlapping matches?
I want the code to return: 04, 31, 41, 412, 12, 55, 555, 55, 12, 123, 1235, 23, 235, 35.
Theoretically it should be possible -- there is an obvious O(n^2)
algorithm that enumerates and checks all the substrings against the pattern.
EDIT
Rather than enumerating all substrings, it is safer to use the region(int start, int end)
method in Matcher
. Checking the pattern against a separate, extracted substring might change the result of the match (e.g. if there is a non-capturing group or word boundary check at the start/end of the pattern).
EDIT 2
Actually, it's unclear whether region()
does what you expect for zero-width matches. The specification is vague, and experiments yield disappointing results.
For example:
String line = "xx90xx";
String pat = "\\b90\\b";
System.out.println(Pattern.compile(pat).matcher(line).find()); // prints false
for (int i = 0; i < line.length(); ++i) {
for (int j = i + 1; j <= line.length(); ++j) {
Matcher m = Pattern.compile(pat).matcher(line).region(i, j);
if (m.find() && m.group().size == (j - i)) {
System.out.println(m.group() + " (" + i + ", " + j + ")"); // prints 90 (2, 4)
}
}
}
I'm not sure what the most elegant solution is. One approach would be to take a substring of line
and pad with with the appropriate boundary characters before checking whether the pat
matches.
EDIT 3
Here is the full solution that I came up with. It can handle zero-width patterns, boundaries, etc. in the original regular expression. It looks through all substrings of the text string and checks whether the regular expression matches only at the specific position by padding the pattern with the appropriate number of wildcards at the beginning and end. It seems to work for the cases I tried -- although I haven't done extensive testing. It is most certainly less efficient than it could be.
public static void allMatches(String text, String regex)
{
for (int i = 0; i < text.length(); ++i) {
for (int j = i + 1; j <= text.length(); ++j) {
String positionSpecificPattern = "((?<=^.{"+i+"})("+regex+")(?=.{"+(text.length() - j)+"}$))";
Matcher m = Pattern.compile(positionSpecificPattern).matcher(text);
if (m.find())
{
System.out.println("Match found: \"" + (m.group()) + "\" at position [" + i + ", " + j + ")");
}
}
}
}
EDIT 4
Here's a better way of doing this: https://stackoverflow.com/a/11372670/244526
EDIT 5
The JRegex library supports finding all overlapping substrings matching a java regex (although it appears not to have been updated in a while). Specifically, the documentation on non-breaking search specifies:
Using non-breaking search you can finding all possible occureneces of a pattern, including those that are intersecting or nested. This is achieved by using the Matcher's method proceed() instead of find()