I know there are similar answer to this on stack, as well as online, but I feel I'm missing something. Given the code below, we need to reconstruct the sequence of events that led to the resulting minimum edit distance. For the code below, we need to write a function that outputs:
Equal, L, L
Delete, E
Equal, A, A
Substitute, D, S
Insert, T
EDIT: CODE IS UPDATED WITH MY (PARTIALLY CORRECT) SOLUTION
Here is the code, with my partial solution. It works for example I was given ("lead" -> "last"), but doesn't work for the example below ("hint" -> "isnt"). I suspect this is because the first character is equal, which is throwing off my code. Any tips or pointers in the right direction would be great!
def printMatrix(M):
for row in M:
print row
print
def med(s, t):
k = len(s) + 1
l = len(t) + 1
M = [[0 for i in range(k)] for j in range(l)]
MTrace = [["" for i in range(k)] for j in range(l)]
M[0][0] = 0
for i in xrange(0, k):
M[i][0] = i
MTrace[i][0] = s[i-1]
for j in xrange(0, l):
M[0][j] = j
MTrace[0][j] = t[j-1]
MTrace[0][0] = "DONE"
for i in xrange(1, k):
for j in xrange(1, l):
sub = 1
sub_op = "sub"
if s[i-1] == t[j-1]:
# equality
sub = 0
sub_op = "eq"
# deletion
min_value = M[i-1][j] + 1
op = "del"
if min_value > M[i][j-1] + 1:
# insertion
min_value = M[i][j-1] + 1
op = "ins"
if min_value > M[i-1][j-1] + sub:
# substitution
min_value = M[i-1][j-1] + sub
op = sub_op
M[i][j] = min_value
MTrace[i][j] = op
print "final Matrix"
printMatrix(M)
printMatrix(MTrace)
############ MY PARTIAL SOLUTION
def array_append(array,x,y):
ops_string = MTrace[x][y]
if ops_string == 'ins':
array.append(("Insert",MTrace[0][y]))
elif ops_string == 'sub':
array.append(("Substitute",MTrace[x][0],MTrace[0][y]))
elif ops_string == 'eq':
array.append(("Equal",MTrace[x][0],MTrace[0][y]))
elif ops_string == 'del':
array.append(("Delete",MTrace[x][0]))
i = len(s)
j = len(t)
ops_array = []
base = M[i][j]
array_append(ops_array,i,j)
while MTrace[i][j] != "DONE":
base = M[i][j]
local_min = min(M[i][j-1],M[i-1][j],M[i-1][j-1])
if base == local_min:
i = i - 1
j = j - 1
array_append(ops_array,i,j)
elif M[i][j-1] < M[i-1][j]:
j = j -1
array_append(ops_array,i,j)
elif M[i-1][j] < M[i][j-1]:
i = i - 1
array_append(ops_array,i,j)
else:
i = i - 1
j = j - 1
array_append(ops_array,i,j)
print ops_array
#########
return M[k-1][l-1]
print med('lead', 'last')
I don't know python, but the following C# code works if that's any help.
Some tests:
It's my opinion that understanding the algorithm more deeply is important in this case. Rather than giving you some pseudocode, I'll walk you through the essential steps of the algorithm, and show you how the data you want is "encoded" in the final matrix that results. Of course, if you don't need to roll your own algorithm, then you should obviously just use someone else's, as MattH suggests!
The Big Picture
This looks to me like an implementation of the Wagner-Fischer algorithm. The basic idea is to calculate the distances between "nearby" prefixes, take the minimum, and then calculate the distance for the current pair of strings from that. So for example, say you have two strings
'i'
and'h'
. Let's lay them out along the vertical and horizontal axes of a matrix, like so:Here,
'_'
denotes an empty string, and each cell in the matrix corresponds to an edit sequence that takes an input (''
or'i'
) to an output (''
or'h'
).The distance from the empty string to any string of length L is L, (requiring L insertions). The distance from any string of length L to the empty string is also L (requiring L deletions). That covers the values in the first row and column, which simply increment.
From there, you can calculate the value of any location by taking the minimum from among the upper, left, and upper-left values, and adding one, or, if the letter is the same at that point in the string, taking the upper-left value unchanged. For the value at
(1, 1)
in the table above, the minimum is0
at(0, 0)
, so the value at(1, 1)
is1
, and that's the minimum edit distance from'i'
to'h'
(one substitution). So in general, the minimum edit distance is always in the lower right corner of the matrix.Now let's do another, comparing
is
tohi
. Here again, each cell in the matrix corresponds to an edit sequence that takes an input (''
,'i'
, or'is'
) to an output (''
,'h'
, or'hi'
).We begin by enlarging the matrix, using
#
as a placeholder for values we don't know yet, and extending the first row and column by incrementing. Having done so, we can begin calculating results for positions marked#
above. Let's start at(2, 1)
(in (row, column), i.e. row-major notation). Among the upper, upper-left, and left values, the minimum is1
. The corresponding letters in the table are different --s
andh
-- so we add one to that minimum value to get2
, and carry on.Let's move on to the value at
(1, 2)
. Now things go a little differently because the corresponding letters in the table are the same -- they're bothi
. This means we have the option of taking the value in the upper-left cell without adding one. The guiding intuition here is that we don't have to increase the count because the same letter is being added to both strings at this position. And since the lengths of both strings have increased by one, we move diagonally.With the last empty cell, things go back to normal. The corresponding letters are
s
andi
, and so we again take the minimum value and add one, to get2
:Here's the table we get if we continue this process for two longer words that start with
is
andhi
--isnt
(ignoring punctuation) andhint
:This matrix is slightly more complex, but the final minimum edit distance here is still just
2
, because the last two letters of these two strings are the same. Convenient!Recreating the Sequence of Edits
So how can we extract the types of edits from this table? The key is to realize that movement on the table corresponds to particular types of edits. So for example, a rightward movement from
(0, 0)
to(0, 1)
takes us from_ -> _
, requiring no edits, to_ -> h
, requiring one edit, an insertion. Likewise, a downward movement from(0, 0)
to(1, 0)
takes us from_ -> _
, requiring no edits, toi -> _
, requiring one edit, a deletion. And finally, a diagonal movement from(0, 0)
to(1, 1)
takes us from_ -> _
, requiring no edits, toi -> h
, requiring one edit, a substitution.So now all we have to do is reverse our steps, tracing local minima from among the upper, left, and upper-left cells back to the origin,
(0, 0)
, keeping in mind that if the current value is the same as the minimum, then we must go to the upper-left cell, since that's the only kind of movement that doesn't increment the edit distance.Here is a detailed description of the steps you could take to do so. Starting from the lower-right corner of the completed matrix, repeat the following until you reach the upper-left corner:
Equal
). No edit was required in this case because the characters at this location are the same.Putting it Together
In the example above, there are two possible paths:
and
Reversing them, we get
and
So for the first version, our first operation is a movement to the right, i.e. an insertion. The letter inserted is
h
, since we're moving fromisnt
tohint
. (This corresponds toInsert, h
in your verbose output.) Our next operation is a diagonal movement, i.e. either a substitution, or a no-op. In this case, it's a no-op because the edit distance is the same at both locations (i.e. the letter is the same). SoEqual, i, i
. Then a downward movement, corresponding to a deletion. The letter deleted iss
, since again, we're moving fromisnt
tohint
. (In general, the letter to insert comes from the output string, while the letter to delete comes from the input string.) So that'sDelete, s
. Then two diagonal movements with no change in value:Equal, n, n
andEqual, t, t
.The result:
Performing these instructions on
isnt
:For a total edit distance of 2.
I'll leave the second minimum path as an exercise. Keep in mind that both paths are completely equivalent; they may be different, but they will result in the same minimum edit distance of 2, and so are entirely interchangeable. At any point as you work backwards through the matrix, if you see two different possible local minima, you may take either one, and the final result is guaranteed to be correct
Once you grok all this, it shouldn't be hard to code at all. The key, in cases like this, is to deeply understand the algorithm first. Once you've done that, coding it up is a cinch.
Accumulation vs. Reconstruction
As a final note, you might chose to accumulate the edits as you populate the matrix. In that case, each cell in your matrix could be a tuple:
(2, ('ins', 'eq', 'del', 'eq', 'eq'))
. You would increment the length, and append the operation corresponding to a movement from the minimal previous state. That does away with the backtracking, and so decreases the complexity of the code; but it takes up extra memory. If you do this, the final edit sequence will appear along with the final edit distance in the lower right corner of the matrix.I suggest you have a look at the python-Levenshtein module. Will probably get you a long way there:
You can process the output from edit ops to create your verbose instructions.