How do I determine the longest similar portion of

As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.

Example:

file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.

Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.

From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.

标签： perl algorithm string similarity

7条回答

Fickle 薄情

2楼-- · 2019-03-19 16:49

The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.

Here's the one I use for longest common prefix (and a ref to original URL):

use strict; use warnings;
sub longest_common_prefix {
    # longest_common_prefix( $|@ ): returns $
    # URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
    # find longest common prefix of scalar list
    my $prefix = shift;
    for (@_) {
        chop $prefix while (! /^\Q$prefix\E/);
        }
    return $prefix;
}

my @str = map {chomp; $_} <DATA>;
print longest_common_prefix(@ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).

#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';

use strict; use warnings;

sub LCS{
    my @str = @_;
    my @pos;
    for my $i (0 .. $#str) {
        my $line = $str[$i];
        for (0 .. length($line) - 1) {
            my $char= substr($line, $_, 1);
            push @{$pos[$i]{$char}}, $_;
        }
    }
    my $sh_str = reduce {length($a) < length($b) ? $a : $b} @str;
    my %map;
    CHAR:
    for my $char (split //, $sh_str) {
        my @loop;
        for (0 .. $#pos) {
            next CHAR if ! $pos[$_]{$char};
            push @loop, $pos[$_]{$char};
        }
        my $next = NestedLoops([@loop]);
        while (my @char_map = $next->()) {
            my $key = join '-', @char_map;
            $map{$key} = $char;
        }
    }
    my @pile;
    for my $seq (keys %map) {
        push @pile, $map{$seq};
        for (1 .. 2) {
            my $dir = $_ % 2 ? 1 : -1;
            my @offset = split /-/, $seq;
            $_ += $dir for @offset;
            my $next = join '-', @offset;
            while (exists $map{$next}) {
                $pile[-1] = $dir > 0 ?
                    $pile[-1] . $map{$next} : $map{$next} . $pile[-1];
                $_ += $dir for @offset;
                $next = join '-', @offset;
            }
        }
    }
    return reduce {length($a) > length($b) ? $a : $b} @pile;
}

my @str = map {chomp; $_} <DATA>;
print LCS(@str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

0人赞添加讨论(0) 举报

Animai°情兽

3楼-- · 2019-03-19 16:55

My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.

In Perl, you'll have to split up the string first into characters using something like

@array = split(//, $string);

(splitting on an empty character sets each character into its own element of the array)

Then do a loop, perhaps overall:

$n =0;
@array1 = split(//, $string1);
@array2 = split(//, $string2);
@array3 = split(//, $string3);

while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
 $n++; 
}

$sameString = substr($string1, 0, $n); #n might have to be n-1

Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.

0人赞添加讨论(0) 举报

做自己的国王

4楼-- · 2019-03-19 17:02

Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):

sub common_suffix {
    my $comm = shift @_;
    while ($_ = shift @_) {
        $_ = substr($_,-length($comm)) if (length($_) > length($comm));
        $comm = substr($comm,-length($_)) if (length($_) < length($comm));
        if (( $_ ^ $comm ) =~ /(\0*)$/) {
            $comm = substr($comm, -length($1));
        } else {
            return undef;
        }
    }
    return $comm;
}


sub common_prefix {
    my $comm = shift @_;
    while ($_ = shift @_) {
        $_ = substr($_,0,length($comm)) if (length($_) > length($comm));
        $comm = substr($comm,0,length($_)) if (length($_) < length($comm));
        if (( $_ ^ $comm ) =~ /^(\0*)/) {
            $comm = substr($comm,0,length($1));
        } else {
            return undef;
        }
    }
    return $comm;
}

0人赞添加讨论(0) 举报

\"骚年 ilove

5楼-- · 2019-03-19 17:06

It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.

0人赞添加讨论(0) 举报

唯我独甜

6楼-- · 2019-03-19 17:08

Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and @str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.

Perl can be fast:

use strict;
use warnings;

package LCP;

sub LCP {
    return '' unless @_;
    return $_[0] if @_ == 1;
    my $i          = 0;
    my $first      = shift;
    my $min_length = length($first);
    foreach (@_) {
        $min_length = length($_) if length($_) < $min_length;
    }
INDEX: foreach my $ch ( split //, $first ) {
        last INDEX unless $i < $min_length;
        foreach my $string (@_) {
            last INDEX if substr($string, $i, 1) ne $ch;
        }
    }
    continue { $i++ }
    return substr $first, 0, $i;
}

# Roy's implementation
sub LCP2 {
    return '' unless @_;
    my $prefix = shift;
    for (@_) {
        chop $prefix while (! /^\Q$prefix\E/);
        }
    return $prefix;
}

1;

Test suite:

#!/usr/bin/env perl

use strict;
use warnings;

Test::LCP->runtests;

package Test::LCP;

use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);

sub test_use : Test(startup => 1) {
    use_ok('LCP');
}

sub test_lcp : Test(6) {
    is( LCP::LCP(),      '',    'Without parameters' );
    is( LCP::LCP('abc'), 'abc', 'One parameter' );
    is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
    is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
        'abcd', 'Some common prefix' );
    my @str = map { chomp; $_ } <DATA>;
    is( LCP::LCP(@str),
        'file:///home/gms8994/Music/', 'Test data prefix' );
    is( LCP::LCP2(@str),
        'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
    my $t = countit( 1, sub{LCP::LCP(@str)} );
    diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
    $t = countit( 1, sub{LCP::LCP2(@str)} );
    diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}

__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

Test suite result:

1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr +  0.00 sys =  1.09 CPU) @ 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr +  0.00 sys =  1.07 CPU) @ 16746.73/s (n=17919)

That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.

0人赞添加讨论(0) 举报

对你真心纯属浪费

7楼-- · 2019-03-19 17:14

If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings. Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.

Mathematica happens to have a function for this built in: http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)

If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).

0人赞添加讨论(0) 举报

1 2 下一页

How do I determine the longest similar portion of

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间