I've designed an algorithm to find the longest common subsequence. these are steps:
Pick the first letter in the first string.
Look for it in the second string and if its found, Add that letter to
common_subsequence
and store its position inindex
, Otherwise compare the length ofcommon_subsequence
with the length oflcs
and if its greater, asign its value tolcs
.Return to the first string and pick the next letter and repeat the previous step again, But this time start searching from
index
th letterRepeat this process until there is no letter in the first string to pick. At the end the value of
lcs
is the Longest Common Subsequence.
This is an example:
X=A, B, C, B, D, A, B
Y=B, D, C, A, B, A
Pick A
in the first string.
Look for A
in Y
.
Now that there is an A
in the second string, append it to common_subsequence
.
Return to the first string and pick the next letter that is B
.
Look for B
in the second string this time starting from the position of A
.
There is a B
after A
so append B to common_subsequence
.
Now pick the next letter in the first string that is C
. There isn't a C
next to B
in the second string. So assign the value of common_subsequence to lcs
because its length is greater than the length of lcs
.
repeat the previous steps until reaching the end of the first string. In the end the value of lcs
is the Longest Common Subsequence.
The complexity of this algorithm is theta(n*m). Here is my implementations:
First algorithm:
import time
def lcs(xstr, ystr):
if not (xstr and ystr): return # if string is empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
for i in xrange(len(xstr)):
cs = '' # common subsequence
start = 0 # start position in ystr
for item in xstr[i:]:
index = ystr.find(item, start) # position at the common letter
if index != -1: # if common letter has found
cs += item # add common letter to the cs
start = index + 1
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
The same algorithm using hash table:
import time
from collections import defaultdict
def lcs(xstr, ystr):
if not (xstr and ystr): return # if strings are empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
location = defaultdict(list) # keeps track of items in the ystr
i = 0
for k in ystr:
location[k].append(i)
i += 1
for i in xrange(len(xstr)):
cs = '' # common subsequence
index = -1
reached_index = defaultdict(int)
for item in xstr[i:]:
for new_index in location[item][reached_index[item]:]:
reached_index[item] += 1
if index < new_index:
cs += item # add item to the cs
index = new_index
break
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
If your professor wants you to invent your own LCS algorithm, you're done. Your algorithm is not the most optimal one ever created, but it's in the right complexity class, you clearly understand it, and you clearly didn't copy your implementation from the internet. You might want to be prepared to defend your algorithm, or discuss alternatives. If I were your prof, I'd give you an A if:
On the other hand, if your professor wants you to pick one of the well-known algorithms and write your own implementation, you probably want to use the standard LP algorithm. It's a standard algorithm for a reason—which you probably want to read up on until you understand. (Even if it isn't going to be on the test, you're taking this class to learn, not just to impress the prof, right?)
Wikipedia has pseudocode for a basic implementation, then English-language descriptions of common optimizations. I'm pretty sure that writing your own Python code based on what's on that page wouldn't count as plagiarism, or even as a trivial port, especially if you can demonstrate that you understand what your code is doing, and why, and why it's a good algorithm. Plus, you're writing it in Python, which has much better ways to memoize than what's demonstrated in that article, so if you understand how it works, your code should actually be substantially better than what Wikipedia gives you.
Either way, as I suggested in the comments, I'd read A survey of longest common subsequence algorithms by Bergroth, Hakonen, and Raita, and search for similar papers online.