I have two sequences AAAAAAAAAGAAAAGAAGAAG, AAAGAAG.
The correct answer is AAGAAG.
But my code gives AA.
There will also be times when two strings will be in this order AAAGAAG, AAAAAAAAAGAAAAGAAGAAG.
Here is my code
`def longestSubstringFinder(string1, string2):
string1=string1.strip()
string2=string2.strip()
answer = ""
len1=len(string1)
len2=len(string2)
if int(len1)>1 and int(len2)>1:
for i in range(1,len1,1):
match = ""
for j in range(len2):
if len1>len2:
if i+j<len1 and (string1[i+j]==string2[i+j]):
match=str(match)+str(string2[i+j])
print(match)
else:
if len(match)>len(answer):
answer=match
match=""
elif len2>len1:
if i+j<len2 and (string1[i+j]==string2[i+j]):
match=str(match)+str(string2[i+j])
print(match)
else:
if len(match)>len(answer):
answer=match
match=""
return(answer)`
Get all the substrings of both strings, find the intersection of both sets of substrings and then find the largest string in the intersection
def get_all_substrings(input_string):
length = len(input_string)
return [input_string[i:j+1] for i in range(length) for j in range(i,length)]
strA = 'AAAAAAAAAGAAAAGAAGAAG'
strB = 'AAAGAAG'
intersection = set(get_all_substrings(strA)).intersection(set(get_all_substrings(strB)))
print(max(intersection, key=len))
>> 'AAAGAAG'
A few weeks ago I've stumbled upon the difflib
package in Python and it's perfect for this kind of job.
Here's a solution to your problem:
import difflib
matcher = difflib.SequenceMatcher()
str1 = 'AGAGGAG'
str2 = 'AAAAAAAAAGAAAAGAAGAAG'
matcher.set_seq2(str2)
matcher.set_seq1(str1)
m = matcher.find_longest_match(0, len(str1), 0, len(str2))
print("Longest sequence of {} found in {}: {}".format(str1, str2, str1[m.a: m.a+m.size]))
# Longest sequence of AAAGAAG found in AAAAAAAAAGAAAAGAAGAAG: AAAGAAG
print(str2[:m.b]+'|'+str2[m.b:m.b+m.size]+'|'+str2[m.b+m.size:])
# AAAAAAAAAGA|AAAGAAG|AAG
str1 = 'AGAG'
matcher.set_seq1(str1)
m = matcher.find_longest_match(0, len(str1), 0, len(str2))
print("Longest sequence of {} found in {}: {}".format(str1, str2, str1[m.a: m.a+m.size]))
# Longest sequence of AGAG found in AAAAAAAAAGAAAAGAAGAAG: AGA
print(str2[:m.b]+'|'+str2[m.b:m.b+m.size]+'|'+str2[m.b+m.size:])
# AAAAAAAA|AGA|AAAGAAGAAG
str1 = 'XXX'
matcher.set_seq1(str1)
m = matcher.find_longest_match(0, len(str1), 0, len(str2))
print("Longest sequence of {} found in {}: {}".format(str1, str2, str1[m.a: m.a+m.size]))
# Longest sequence of XXX found in AAAAAAAAAGAAAAGAAGAAG:
print(str2[:m.b]+'|'+str2[m.b:m.b+m.size]+'|'+str2[m.b+m.size:])
# ||AAAAAAAAAGAAAAGAAGAAG
The difflib documentation says:
SequenceMatcher computes and caches detailed information about the
second sequence, so if you want to compare one sequence against many
sequences, use set_seq2() to set the commonly used sequence once and
call set_seq1() repeatedly, once for each of the other sequences.
And it's also very fast!
I've timed @AK47 great solution and it timed 10000 loops, best of 3: 85.2 µs per loop
My solution timed 10000 loops, best of 3: 31.6 µs per loop