How can I merge overlapping strings in python?

2019-02-20 11:36发布

I have some strings,

['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']

These strings partially overlap each other. If you manually overlapped them you would get:

SGALWDVPSPV

I want a way to go from the list of overlapping strings to the final compressed string in python. I feel like this must be a problem that someone has solved already and am trying to avoid reinventing the wheel. The methods I can imagine now are either brute force or involve getting more complicated by using biopython and sequence aligners than I would like. I have some simple short strings and just want to properly merge them in a simple way.

Does anyone have any advice on a nice way to do this in python? Thanks!

5条回答
Emotional °昔
2楼-- · 2019-02-20 11:53

Here's my solution which borders on brute force from the OP's perspective. It's not bothered by order (threw in a random shuffle to confirm that) and there can be non-matching elements in the list, as well as other independent matches. Assumes overlap means not a proper subset but independent strings with elements in common at the start and end:

from collections import defaultdict
from random import choice, shuffle

def overlap(a, b):
    """ get the maximum overlap of a & b plus where the overlap starts """

    overlaps = []

    for i in range(len(b)):
        for j in range(len(a)):
            if a.endswith(b[:i + 1], j):
                overlaps.append((i, j))

    return max(overlaps) if overlaps else (0, -1)

lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV', 'NONSEQUITUR']

shuffle(lst)  # to verify order doesn't matter

overlaps = defaultdict(list)

while len(lst) > 1:
    overlaps.clear()

    for a in lst:
        for b in lst:
            if a == b:
                continue

            amount, start = overlap(a, b)
            overlaps[amount].append((start, a, b))

    maximum = max(overlaps)

    if maximum == 0:
        break

    start, a, b = choice(overlaps[maximum])  # pick one among equals

    lst.remove(a)
    lst.remove(b)
    lst.append(a[:start] + b)

print(*lst)

OUTPUT

% python3 test.py
NONSEQUITUR SGALWDVPSPV
%

Computes all the overlaps and combines the largest overlap into a single element, replacing the original two, and starts process over again until we're down to a single element or no overlaps.

The overlap() function is horribly inefficient and likely can be improved but that doesn't matter if this isn't the type of matching the OP desires.

查看更多
混吃等死
3楼-- · 2019-02-20 11:57

Here is a quick sorting solution:

s = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
new_s = sorted(s, key=lambda x:s[0].index(x[0]))
a = new_s[0]
b = new_s[-1]
final_s = a[:a.index(b[0])]+b

Output:

'SGALWDVPSPV'

This program sorts s by the value of the index of the first character of each element, in an attempt to find the string that will maximize the overlap distance between the first element and the desired output.

查看更多
smile是对你的礼貌
4楼-- · 2019-02-20 11:57

To determine the overlap of two strings a and b, you can check if any prefix of b is a suffix of a. You can then use that check in a simple loop, aggregating the result and slicing the next string in the list according to the overlap.

lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']

def overlap(a, b):
    return max(i for i in range(len(b)+1) if a.endswith(b[:i]))

res = lst[0]
for s in lst[1:]:
    o = overlap(res, s)
    res += s[o:]
print(res) # SGALWDVPSPV

Or using reduce:

from functools import reduce # Python 3
print(reduce(lambda a, b: a + b[overlap(a,b):], lst))

This is probably not super-efficient, with complexity of about O(n k), with n being the number of strings in the list and k the average length per string. You can make it a bit more efficient by only testing whether the last char of the presumed overlap of b is the last character of a, thus reducing the amount of string slicing and function calls in the generator expression:

def overlap(a, b):
    return max(i for i in range(len(b)) if b[i-1] == a[-1] and a.endswith(b[:i]))
查看更多
小情绪 Triste *
5楼-- · 2019-02-20 12:01

Once the peptides start to grow to 20 aminoacids cdlane's code chokes and spams (multiple) incorrect answer(s) with various amino acid lengths.

Try to add and use AA sequence 'VPSGALWDVPS' with or without 'D' and the code starts to fail its task because the N-and C-terminus grow and do not reflect what Adam Price is asking for. The output is: 'SGALWDVPSGALWDVPSPV' and thus 100% incorrect despite the effort.

Tbh imo there is only one 100% answer and that is to use BLAST and its protein search page or BLAST in the BioPython package. Or adapt cdlane's code to reflect AA gaps, substitutions and AA additions.

查看更多
smile是对你的礼貌
6楼-- · 2019-02-20 12:03

My proposed solution with a more challenging test list:

#strFrag = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
strFrag = ['ALWDVPS', 'SGALWDV', 'LWDVPSP', 'WDVPSPV', 'GALWDVP', 'LWDVPSP', 'ALWDVPS']

for repeat in range(0, len(strFrag)-1):
    bestMatch = [2, '', ''] #overlap score (minimum value 3), otherStr index, assembled str portion
    for otherStr in strFrag[1:]:
        for x in range(0,len(otherStr)):
            if otherStr[x:] == strFrag[0][:len(otherStr[x:])]:
                if len(otherStr)-x > bestMatch[0]:
                    bestMatch = [len(otherStr)-x, strFrag.index(otherStr), otherStr[:x]+strFrag[0]]
            if otherStr[:-x] == strFrag[0][-len(otherStr[x:]):]:
                if x > bestMatch[0]:
                    bestMatch = [x, strFrag.index(otherStr), strFrag[0]+otherStr[-x:]]
    if bestMatch[0] > 2:
        strFrag[0] = bestMatch[2]
        strFrag = strFrag[:bestMatch[1]]+strFrag[bestMatch[1]+1:]

print(strFrag)       
print(strFrag[0])

Basically the code compares every string/fragment to the first in list and finds the best match (most overlap). It consolidates the list progressively, merging the best matches and removing the individual strings. Code assumes that there are no unfillable gaps between strings/fragments (Otherwise answer may not result in longest possible assembly. Can be solved by randomizing the starting string/fragment). Also assumes that the reverse complement is not present (poor assumption with contig assembly), which would result in nonsense/unmatchable strings/fragments. I've included a way to restrict the minimum match requirements (changing bestMatch[0] value) to prevent false matches. Last assumption is that all matches are exact. To enable flexibility in permitting mismatches when assembling the sequence makes the problem considerably more complex. I can provide a solution for assembling with mismatches upon request.

查看更多
登录 后发表回答