I've designed an algorithm to find the longest common subsequence. these are steps:
Pick the first letter in the first string.
Look for it in the second string and if its found, Add that letter to
common_subsequence
and store its position inindex
, Otherwise compare the length ofcommon_subsequence
with the length oflcs
and if its greater, asign its value tolcs
.Return to the first string and pick the next letter and repeat the previous step again, But this time start searching from
index
th letterRepeat this process until there is no letter in the first string to pick. At the end the value of
lcs
is the Longest Common Subsequence.
This is an example:
X=A, B, C, B, D, A, B
Y=B, D, C, A, B, A
Pick A
in the first string.
Look for A
in Y
.
Now that there is an A
in the second string, append it to common_subsequence
.
Return to the first string and pick the next letter that is B
.
Look for B
in the second string this time starting from the position of A
.
There is a B
after A
so append B to common_subsequence
.
Now pick the next letter in the first string that is C
. There isn't a C
next to B
in the second string. So assign the value of common_subsequence to lcs
because its length is greater than the length of lcs
.
repeat the previous steps until reaching the end of the first string. In the end the value of lcs
is the Longest Common Subsequence.
The complexity of this algorithm is theta(n*m). Here is my implementations:
First algorithm:
import time
def lcs(xstr, ystr):
if not (xstr and ystr): return # if string is empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
for i in xrange(len(xstr)):
cs = '' # common subsequence
start = 0 # start position in ystr
for item in xstr[i:]:
index = ystr.find(item, start) # position at the common letter
if index != -1: # if common letter has found
cs += item # add common letter to the cs
start = index + 1
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
The same algorithm using hash table:
import time
from collections import defaultdict
def lcs(xstr, ystr):
if not (xstr and ystr): return # if strings are empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
location = defaultdict(list) # keeps track of items in the ystr
i = 0
for k in ystr:
location[k].append(i)
i += 1
for i in xrange(len(xstr)):
cs = '' # common subsequence
index = -1
reached_index = defaultdict(int)
for item in xstr[i:]:
for new_index in location[item][reached_index[item]:]:
reached_index[item] += 1
if index < new_index:
cs += item # add item to the cs
index = new_index
break
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed