Jaro-winkler function: why is the same score match

2019-08-05 01:22发布

问题:

I am using the jaro-winkler fuzzy matching to match names.

I am trying to determine a cut-off range for the similarity score. If the names are too different, I want to exclude them for manual review.

While anything below .4 seemed to be different names entirely, the .4 range seemed fairly similar.

But then I came across strange exceptions, where some names in that range are entirely different, while some names are only one or two letters off(see example below).

Can someone explain where there is the wide variation of matching within the same matching score range?

   Estrella     ANNELISE    0.42 
   Arienna      IREANNA     0.43 
   Tayvia       I TAYVIA    0.43
   Amanda       IZABEL      0.44
   Hunter       JOSHUA      0.44
   Ryder        CHARLES     0.45
   Luis         ELIZABETH   0.45 
   Sebastian    JOSE        0.45 
   Christopher  CHISTOPHE   0.46 
   Genayunique  GENAY-UNI   0.46 
   Andreeaonn   ADREEAONN   0.46
   Chistopher   CHRISTOPH   0.46
   Dazharicon   DAZHARION   0.46
   Jennavecia   JENNACVEC   0.46
   Valentiria   VALENTINA   0.46
   Abel         SAMMUEL     0.46
   Dezarea MarieDEZAREA     0.47
   Alexander    ALEXZANDE   0.47

回答1:

The Jaro-Winkler distance formula is biased towards strings with a common beginning. For example, Valentina and Valentiria.

It also has some not so intuitive "rules" (see wikipedia).

You should probably first determine what kind of dissimilarity you are expecting, and then looking for a suitable distance formula. For example, in writing, "angleworm" and "angelworm" is a very likely error, so the distance between the two strings ought to be low. While mismatching "there" and "three" is less likely and "ether" even more so. With longer anagrams, the Jaro distance might be exactly the same, and even the Winkler correction might not kick in.

As you can read in this page (emphasis mine)

Beyond the optimization for empty strings and those which are exactly the same, you can see here that I weight the first character even more heavily. This is due to my data being very initial heavy.

To compensate for the frequent use of middle initials I count Jaro-Winkler distance as 80% of the score, while the remaining 20% is fully based on the first character matching. The value of p here was determined by the results of heavy experimentation and hair pulling. Before making this extension initials would frequently align incorrectly.



回答2:

I found that Levenshtein distance was more useful for the specific matching problems on names.