All overlapping substrings matching a java regex

2020-02-12 08:56发布

问题:

Is there an API method that returns all (possibly overlapping) substrings that match a regular expression?

For example, I have a text string: String t = 04/31 412-555-1235;, and I have a pattern: Pattern p = new Pattern("\\d\\d+"); that matches strings of two or more characters.

The matches I get are: 04, 31, 412, 555, 1235.

How do I get overlapping matches?

I want the code to return: 04, 31, 41, 412, 12, 55, 555, 55, 12, 123, 1235, 23, 235, 35.

Theoretically it should be possible -- there is an obvious O(n^2) algorithm that enumerates and checks all the substrings against the pattern.

EDIT

Rather than enumerating all substrings, it is safer to use the region(int start, int end) method in Matcher. Checking the pattern against a separate, extracted substring might change the result of the match (e.g. if there is a non-capturing group or word boundary check at the start/end of the pattern).

EDIT 2

Actually, it's unclear whether region() does what you expect for zero-width matches. The specification is vague, and experiments yield disappointing results.

For example:

String line = "xx90xx";
String pat = "\\b90\\b";
System.out.println(Pattern.compile(pat).matcher(line).find()); // prints false
for (int i = 0; i < line.length(); ++i) {
  for (int j = i + 1; j <= line.length(); ++j) {
    Matcher m = Pattern.compile(pat).matcher(line).region(i, j);
    if (m.find() && m.group().size == (j - i)) {
      System.out.println(m.group() + " (" + i + ", " + j + ")"); // prints 90 (2, 4)
    }
  }
}

I'm not sure what the most elegant solution is. One approach would be to take a substring of line and pad with with the appropriate boundary characters before checking whether the pat matches.

EDIT 3

Here is the full solution that I came up with. It can handle zero-width patterns, boundaries, etc. in the original regular expression. It looks through all substrings of the text string and checks whether the regular expression matches only at the specific position by padding the pattern with the appropriate number of wildcards at the beginning and end. It seems to work for the cases I tried -- although I haven't done extensive testing. It is most certainly less efficient than it could be.

  public static void allMatches(String text, String regex)
  {
    for (int i = 0; i < text.length(); ++i) {
      for (int j = i + 1; j <= text.length(); ++j) {
        String positionSpecificPattern = "((?<=^.{"+i+"})("+regex+")(?=.{"+(text.length() - j)+"}$))";
        Matcher m = Pattern.compile(positionSpecificPattern).matcher(text);

        if (m.find()) 
        {   
          System.out.println("Match found: \"" + (m.group()) + "\" at position [" + i + ", " + j + ")");
        }   
      }   
    }   
  }

EDIT 4

Here's a better way of doing this: https://stackoverflow.com/a/11372670/244526

EDIT 5

The JRegex library supports finding all overlapping substrings matching a java regex (although it appears not to have been updated in a while). Specifically, the documentation on non-breaking search specifies:

Using non-breaking search you can finding all possible occureneces of a pattern, including those that are intersecting or nested. This is achieved by using the Matcher's method proceed() instead of find()

回答1:

I faced a similar situation and I tried the above answers but in my case it took too much of time by setting the start and end index of the matcher but I think I've found a better solution, I'm posting it here for others. So below is my code sniplet.

if (textToParse != null) {
Matcher matcher = PLACEHOLDER_PATTERN.matcher(textToParse);
    while(matcher.hitEnd()!=true){
        Boolean result = matcher.find();
        int count = matcher.groupCount();
        System.out.println("Result " +result+" count "+count);
        if(result==true && count==1){
            mergeFieldName = matcher.group(1);
            mergeFieldNames.add(mergeFieldName);
           }
       }
  }

I have used the matcher.hitEnd() method to check if i have reached the end of text.

Hope this helps. Thanks!



回答2:

It is doable as O(n) only if you specify the range of allowed number length.

Let's say from 2-4 digits (numbers 00-9999): (?=(\\d{2}))(?=(\\1\\d)?)(?=(\\2\\d)?)

This is a zero-length assertion via positive lookahead, capturing such lookahead into groups. The results is an array of all 2-4 digit strings that can be found within the regex input, together with duplicates and empty strings (for non-match captures).

I am not a Java developer, but I believe a Perl script can be read as an example as well.

#!/usr/bin/perl                                       # perl script
use List::MoreUtils qw/ uniq /;                       # uniq subroutine library
$_ = '04/31 412-555-1235';                            # input
my @n = uniq (/(?=(\d{2}))(?=(\1\d)?)(?=(\2\d)?)/g);  # regex (single slash in Perl)
print "$_\n" for grep(/\S/, @n);                      # print non-empty lines

The trick is using backreferences. If you would like to capture 2-5 digit string, you would need to use one more positive lookahead in the regex: (?=(\\d{2}))(?=(\\1\\d)?)(?=(\\2\\d)?)(?=(\\3\\d)?).

I believe this is a closest approach you can make. If this works for you, drop a comment and hopefully some Java developer will edit my answer with Java code for the above script.



回答3:

The closest you can get is something like this.

"(?=((\\d*)\\d))(?=(\\d)\\d*)"

The result will be in capturing group 1, 2 and 3.

As far as my imagination can go, I can only think of capturing in zero-length assertion as a viable way to recapture the same position of a string. Capturing text outside the zero-length assertion will consume the text once and for all (look-behind can only capture fixed-length in Java, so it can considered to be inaccessible).

This solution is not perfect: aside from repetition (of text at same position!) and empty string matches, it won't capture all possible substrings.

One way to capture all possible substrings is construct the following regex with value of n starting from 1:

"(?=(\\d{" + n + "}))"

And match the string against this for incrementing value of n until there is no match.

This method is of course, inefficient compared to the method of matching all numbers with "\d+" and extract all substring.