Levenshtein distance: how to better handle words s-第2页回答

I've had some success comparing strings using the PHP levenshtein function.

However, for two strings which contain substrings that have swapped positions, the algorithm counts those as whole new substrings.

For example:

levenshtein("The quick brown fox", "brown quick The fox"); // 10 differences

are treated as having less in common than:

levenshtein("The quick brown fox", "The quiet swine flu"); // 9 differences

I'd prefer an algorithm which saw that the first two were more similar.

How could I go about coming up with a comparison function that can identify substrings which have switched position as being distinct to edits?

One possible approach I've thought of is to put all the words in the string into alphabetical order, before the comparison. That takes the original order of the words completely out of the comparison. A downside to this, however, is that changing just the first letter of a word can create a much bigger disruption than a changing a single letter should cause.

What I'm trying to achieve is to compare two facts about people which are free text strings, and decide how likely these facts are to indicate the same fact. The facts might be the school someone attended, the name of their employer or publisher, for example. Two records may have the same school spelled differently, words in a different order, extra words, etc, so the matching has to be somewhat fuzzy if we are to make a good guess that they refer to the same school. So-far it is working very well for spelling errors (I am using a phoenetic algorithm similar to metaphone on top of this all) but very poorly if you switch the order of words around which seem common in a school: "xxx college" vs "college of xxx".

标签： php algorithm levenshtein-distance similarity edit-distance

9条回答

别忘想泡老子

2楼-- · 2019-03-07 23:03

Its easy. Just use the Damerau-Levenshtein distance on the words instead of letters.

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

3楼-- · 2019-03-07 23:03

Take this answer and make the following change:

void match(trie t, char* w, string s, int budget){
  if (budget < 0) return;
  if (*w=='\0') print s;
  foreach (char c, subtrie t1 in t){
    /* try matching or replacing c */
    match(t1, w+1, s+c, (*w==c ? budget : budget-1));
    /* try deleting c */
    match(t1, w, s, budget-1);
  }
  /* try inserting *w */
  match(t, w+1, s + *w, budget-1);
  /* TRY SWAPPING FIRST TWO CHARACTERS */
  if (w[1]){
    swap(w[0], w[1]);
    match(t, w, s, budget-1);
    swap(w[0], w[1]);
  }
}

This is for dictionary search in a trie, but for matching to a single word, it's the same idea. You're doing branch-and-bound, and at any point, you can make any change you like, as long as you give it a cost.

0人赞添加讨论(0) 举报

太酷不给撩

4楼-- · 2019-03-07 23:05

Eliminate duplicate words between the two strings and then use Levenshtein.

0人赞添加讨论(0) 举报

上一页 1 2

Levenshtein distance: how to better handle words s

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间