Regex to match the longest repeating substring

2019-01-18 03:37发布

站内文章 / 移动开发

58 0

再贱就再见

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm writing regular expression for checking if there is a substring, that contains at least 2 repeats of some pattern next to each other. I'm matching the result of regex with former string - if equal, there is such pattern. Better said by example: 1010 contains pattern 10 and it is there 2 times in continuous series. On other hand 10210 wouldn't have such pattern, because those 10 are not adjacent.

What's more, I need to find the longest pattern possible, and it's length is at least 1. I have written the expression to check for it ^.*?(.+)(\1).*?$. To find longest pattern, I've used non-greedy version to match something before patter, then pattern is matched to group 1 and once again same thing that has been matched for group1 is matched. Then the rest of string is matched, producing equal string. But there's a problem that regex is eager to return after finding first pattern, and don't really take into account that I intend to make those substrings before and after shortest possible (leaving the rest longest possible). So from string 01011010 I get correctly that there's match, but the pattern stored in group 1 is just 01 though I'd except 101.

As I believe I can't make pattern "more greedy" or trash before and after even "more non-greedy" I can only come whit an idea to make regex less eager, but I'm not sure if this is possible.

Further examples:

56712453289 - no pattern - no match with former string
22010110100 - pattern 101 - match with former string (regex resulted in 22010110100 with 101 in group 1)
5555555 - pattern 555 - match
1919191919 - pattern 1919 - match
191919191919 - pattern 191919 - match
2323191919191919 - pattern 191919 - match

What I would get using current expression (same strings used):

no pattern - no match
pattern 2 - match
pattern 555 - match
pattern 1919 - match
pattern 191919 - match
pattern 23 - match

回答1:

In Perl you can do it with one expression with help of (??{ code }):

$_ = '01011010';
say /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;

Output:

What happens here is that after a matching consecutive pair of substrings, we make sure with a negative lookahead that there is no longer pair following it.

To make the expression for the longer pair a postponed subexpression construct is used (??{ code }), which evaluates the code inside (every time) and uses the returned string as an expression.

The subexpression it constructs has the form .+?(..{N,})\1, where N is the current length of the first capturing group (length($^N), $^N contains the current value of the previous capturing group).

Thus the full expression would have the form:

(?=(.+)\1)(?!.+?(..{N,})\2}))

With the magical N (and second capturing group not being a "real"/proper capturing group of the original expression).

Usage example:

use v5.10;

sub longest_rep{
    $_[0] =~ /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
}

say longest_rep '01011010';
say longest_rep '010110101000110001';
say longest_rep '2323191919191919';
say longest_rep '22010110100';

Output:

回答2:

You can do it in a single regex, you just have to pick the longest match from the list of results manually.

def longestrepeating(strg):
    regex = re.compile(r"(?=(.+)\1)")
    matches = regex.findall(strg)
    if matches:
        return max(matches, key=len)

This gives you (since re.findall() returns a list of the matching capturing groups, even though the matches themselves are zero-length):

>>> longestrepeating("yabyababyab")
'abyab'
>>> longestrepeating("10100101")
'010'
>>> strings = ["56712453289", "22010110100", "5555555", "1919191919", 
               "191919191919", "2323191919191919"]
>>> [longestrepeating(s) for s in strings]
[None, '101', '555', '1919', '191919', '191919']

回答3:

Here's a long-ish script that does what you ask. It basically goes through your input string, shortens it by one, then goes through it again. Once all possible matches are found, it returns one of the longest. It is possible to tweak it so that all the longest matches are returned, instead of just one, but I'll leave that to you.

It's pretty rudimentary code, but hopefully you'll get the gist of it.

use v5.10;
use strict;
use warnings;

while (<DATA>) {
    chomp;
    print "$_ : ";
    my $longest = foo($_);
    if ($longest) {
        say $longest;
    } else {
        say "No matches found";
    }
}

sub foo {
    my $num = shift;
    my @hits;
    for my $i (0 .. length($num)) {
        my $part = substr $num, $i;
        push @hits, $part =~ /(.+)(?=\1)/g;
    }
    my $long = shift @hits;
    for (@hits) {
        if (length($long) < length) {
            $long = $_;
        }
    }
    return $long;
}

__DATA__
56712453289
22010110100
5555555
1919191919
191919191919
2323191919191919

回答4:

Not sure if anyone's thought of this...

my $originalstring="pdxabababqababqh1234112341";

my $max=int(length($originalstring)/2);
my @result;
foreach my $n (reverse(1..$max)) {
    @result=$originalstring=~m/(.{$n})\1/g;
    last if @result;
}

print join(",",@result),"\n";

The longest doubled match cannot exceed half the length of the original string, so we count down from there.

If the matches are suspected to be small relative to the length of the original string, then this idea could be reversed... instead of counting down until we find the match, we count up until there are no more matches. Then we need to back up 1 and give that result. We would also need to put a comma after the $n in the regex.

my $n;
foreach (1..$max) {
    unless (@result=$originalstring=~m/(.{$_,})\1/g) {
        $n=--$_;
        last;
    }
}
@result=$originalstring=~m/(.{$n})\1/g;

print join(",",@result),"\n";

回答5:

Regular expressions can be helpful in solving this, but I don't think you can do it as a single expression, since you want to find the longest successful match, whereas regexes just look for the first match they can find. Greediness can be used to tweak which match is found first (earlier vs. later in the string), but I can't think of a way to prefer an earlier, longer substring over a later, shorter substring while also preferring a later, longer substring over an earlier, shorter substring.

One approach using regular expressions would be to iterate over the possible lengths, in decreasing order, and quit as soon as you find a match of the specified length:

my $s = '01011010';
my $one = undef;
for(my $i = int (length($s) / 2); $i > 0; --$i)
{
  if($s =~ m/(.{$i})\1/)
  {
    $one = $1;
    last;
  }
}
# now $one is '101'