Modifying Levenshtein Distance for positional Bias

2019-02-25 18:49发布

问题:

I am using the Levenshtein distance algorithm to compare a company name provided as a user input against a database of known company names to find closest match. By itself, the algorithm works okay, but I want to build in a Bias so that the edit distance is considered lower if the initial parts of the strings match.

For Example, if the search criteria is "ABCD", then both "ABCD Co." and "XYX ABCD" have identical Edit Distance. However I want to add weight to the fact that the initial parts of the first string matches the search criteria more closely than the second string.

One way of doing this might be to modify the insert/delete/replace costs to be higher at the beginning of the strings and lower towards the end. Does anyone have an example of a successful implementation of this? Is using Levenshtein distance still the best way to do what I am trying to achieve? Is my assumption of the approach accurate?

UPDATE: For my immediate purposes I have decided to forgo the above and instead use the Jaro Winkler edit distance which seems to solve the problem. However I will leave this open for further inputs.

回答1:

What you're looking for looks like a Smith-Waterman local alignment: http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm