Difference between Jaro-Winkler and Levenshtein di

2019-01-29 20:18发布

问题:

I have a use case where I need to do fuzzy matching of millions of records from multiple files. I identified two algorithms for that: Jaro-Winkler and Levenshtein edit distance.

When I started exploring both, I was not able to understand what the exact difference is between the two. It seems Levenshtein gives the number of edits between two strings, and Jaro-Winkler gives a matching score between 0.0 to 1.0. I didn't understand the algorithm. As I need to use either algorithm, I need to know the exact differences with respect to algorithm performance.

回答1:

Levenshtein counts the number of edits (insertions, deletions, or substitutions) needed to convert one string to the other. Damerau-Levenshtein is a modified version that also considers transpositions as single edits. Although the output is the integer number of edits, this can be normalized to give a similarity value by the formula

1 - (edit distance / length of the larger of the two strings)

The Jaro algorithm is a measure of characters in common, being no more than half the length of the longer string in distance, with consideration for transpositions. Winkler modified this algorithm to support the idea that differences near the start of the string are more significant than differences near the end of the string. Jaro and Jaro-Winkler are suited for comparing smaller strings like words and names.

Deciding which to use is not just a matter of performance. It's important to pick a method that is suited to the nature of the strings you are comparing. In general though, both of the algorithms you mentioned can be expensive, because each string must be compared to every other string, and with millions of strings in your data set, that is a tremendous number of comparisons. That is much more expensive than something like computing a phonetic encoding for each string, and then simply grouping strings sharing identical encodings.

There is a wealth of detailed information on these algorithms and other fuzzy string matching algorithms on the internet. This one will give you a start:

A Comparison of Personal Name Matching: Techniques and Practical Issues

According to that paper, the speed of the four Jaro and Levenshtein algorithms I've mentioned are from fastest to slowest:

  • Jaro
  • Jaro-Winkler
  • Levenshtein
  • Damerau-Levenshtein

with the slowest taking 2 to 3 times as long as the fastest. Of course these times are dependent on the lengths of the strings and the implementations, and there are ways to optimize these algorithms that may not have been used.