Edit distance algorithm explanation

2019-07-13 06:33发布

According to wikipedia, the definition of the recursive formula which calculates the Levenshtein distance between two strings a and b is the following:

enter image description here

I don't understand why we don't take into consideration the cases in which we delete a[j], or we insert b[i]. Also, correct me if I am wrong, isn't the case of insertion the same as the case of the deletion? I mean, instead of deleting a character from one string, we could insert the same character in the second string, and the opposite. So why not merge the insert/delete operations into one operation with cost equal to min{cost_insert, cost_delete}?

2条回答
走好不送
2楼-- · 2019-07-13 06:51

If the costs of insertions and deletions differ, inserting in one string isn't the same as deleting from the other. Even the cost of a substitution could differ from the cost of insertion+deletion and it is wise to keep them separate.

The problem is usually asymmetric: you have a list of valid strings, and you want to match to another that has errors in it.

查看更多
乱世女痞
3楼-- · 2019-07-13 07:14

This is not done, because you are not allowed to edit both strings. The definition of the edit distance (from wikipedia) is:

the minimum-weight series of edit operations that transforms a into b.

So you are specifically looking for (the weight of) a sequence of operations to execute on the string a to transform it into string b.

Also, the edit distance is not necessarily symmetric. If your costs for inserts and deletions are identical the distance is symmetric: d(a,b) = d(b,a)

Consider the wikipedia example but with different costs:

  • costs for insertions: w_ins = 1
  • costs for deletions: w_del = 2
  • costs for substitutions: w_sub = 1

The distance of kitten and sitting still is 3,

kitten -> sitten  (substitution k->s, cost 1)
sitten -> sittin  (substitution e->i, cost 1)
sittin -> sitting (insertion of g, cost 1)
=> d(kitten, sitting) = 3

but the distance of sitting and kitten is not:

sitting -> kitting  (substitution s->k, cost 1)
kitting -> kitteng  (substitution i->e, cost 1)
kitteng -> kitten (deletion of g, cost 2)
=> d(kitten, sitting) = 4

You see that d(kitten, sitting) != d(kitten, sitting).

On the other hand if you do use symmetric costs, as the Levenshtein distance (which is an edit distance with unit costs) does, you can assume that d(a,b) = d(b,a) holds. Then you do not win anything by also considering the inverse cases. What you lose is the information which character has been replaced in which string, what makes it harder to extract the sequence of operations afterwards.

The Wagner-Fisher algorithm which you are showing in your question can extract this from the DP matrix by backtracking the path with minimal costs. Consider this two edit matrices between to and foo with unit costs:

  t o       f o o
f 1 2     t 1 2 3
o 2 1     o 2 1 2
o 3 2

Note that if you transpose the matrix for d(to, foo) you get the matrix for d(foo, to). Note that by this, an insertion in the first matrix becomes a deletion in the second matrix and vice versa. So this is where this symmetry you are looking for is coming up again.

I hope this helps :)

查看更多
登录 后发表回答