Find the longest substring in two genomic sequence

2019-08-21 11:23发布

问题:

I have two sequences AAAAAAAAAGAAAAGAAGAAG, AAAGAAG. The correct answer is AAGAAG.

But my code gives AA.

There will also be times when two strings will be in this order AAAGAAG, AAAAAAAAAGAAAAGAAGAAG.

Here is my code

`def longestSubstringFinder(string1, string2):
    string1=string1.strip()
    string2=string2.strip()
    answer = ""
    len1=len(string1)
    len2=len(string2)
    if int(len1)>1 and int(len2)>1:
        for i in range(1,len1,1):
            match = ""
            for j in range(len2):
                if len1>len2:
                    if i+j<len1 and (string1[i+j]==string2[i+j]):
                        match=str(match)+str(string2[i+j])
                        print(match)
                    else:
                        if len(match)>len(answer):
                            answer=match
                            match=""
                elif len2>len1:
                    if i+j<len2 and (string1[i+j]==string2[i+j]):
                        match=str(match)+str(string2[i+j])
                        print(match)
                    else:
                        if len(match)>len(answer):
                            answer=match
                            match=""
    return(answer)`

回答1:

Get all the substrings of both strings, find the intersection of both sets of substrings and then find the largest string in the intersection

def get_all_substrings(input_string):
  length = len(input_string)
  return [input_string[i:j+1] for i in range(length) for j in range(i,length)]

strA = 'AAAAAAAAAGAAAAGAAGAAG'
strB = 'AAAGAAG'

intersection = set(get_all_substrings(strA)).intersection(set(get_all_substrings(strB)))
print(max(intersection, key=len))
>> 'AAAGAAG'


回答2:

A few weeks ago I've stumbled upon the difflib package in Python and it's perfect for this kind of job.

Here's a solution to your problem:

import difflib
matcher = difflib.SequenceMatcher()

str1 = 'AGAGGAG'
str2 = 'AAAAAAAAAGAAAAGAAGAAG'
matcher.set_seq2(str2)
matcher.set_seq1(str1)

m = matcher.find_longest_match(0, len(str1), 0, len(str2))
print("Longest sequence of {} found in {}: {}".format(str1, str2, str1[m.a: m.a+m.size]))
# Longest sequence of AAAGAAG found in AAAAAAAAAGAAAAGAAGAAG: AAAGAAG
print(str2[:m.b]+'|'+str2[m.b:m.b+m.size]+'|'+str2[m.b+m.size:])
# AAAAAAAAAGA|AAAGAAG|AAG

str1 = 'AGAG'

matcher.set_seq1(str1)

m = matcher.find_longest_match(0, len(str1), 0, len(str2))
print("Longest sequence of {} found in {}: {}".format(str1, str2, str1[m.a: m.a+m.size]))
# Longest sequence of AGAG found in AAAAAAAAAGAAAAGAAGAAG: AGA
print(str2[:m.b]+'|'+str2[m.b:m.b+m.size]+'|'+str2[m.b+m.size:])
# AAAAAAAA|AGA|AAAGAAGAAG

str1 = 'XXX'

matcher.set_seq1(str1)

m = matcher.find_longest_match(0, len(str1), 0, len(str2))
print("Longest sequence of {} found in {}: {}".format(str1, str2, str1[m.a: m.a+m.size]))
# Longest sequence of XXX found in AAAAAAAAAGAAAAGAAGAAG: 
print(str2[:m.b]+'|'+str2[m.b:m.b+m.size]+'|'+str2[m.b+m.size:])
# ||AAAAAAAAAGAAAAGAAGAAG

The difflib documentation says:

SequenceMatcher computes and caches detailed information about the second sequence, so if you want to compare one sequence against many sequences, use set_seq2() to set the commonly used sequence once and call set_seq1() repeatedly, once for each of the other sequences.

And it's also very fast!

I've timed @AK47 great solution and it timed 10000 loops, best of 3: 85.2 µs per loop

My solution timed 10000 loops, best of 3: 31.6 µs per loop