I am using the jaro-winkler fuzzy matching to match names.
I am trying to determine a cut-off range for the similarity score. If the names are too different, I want to exclude them for manual review.
While anything below .4 seemed to be different names entirely, the .4 range seemed fairly similar.
But then I came across strange exceptions, where some names in that range are entirely different, while some names are only one or two letters off(see example below).
Can someone explain where there is the wide variation of matching within the same matching score range?
Estrella ANNELISE 0.42
Arienna IREANNA 0.43
Tayvia I TAYVIA 0.43
Amanda IZABEL 0.44
Hunter JOSHUA 0.44
Ryder CHARLES 0.45
Luis ELIZABETH 0.45
Sebastian JOSE 0.45
Christopher CHISTOPHE 0.46
Genayunique GENAY-UNI 0.46
Andreeaonn ADREEAONN 0.46
Chistopher CHRISTOPH 0.46
Dazharicon DAZHARION 0.46
Jennavecia JENNACVEC 0.46
Valentiria VALENTINA 0.46
Abel SAMMUEL 0.46
Dezarea MarieDEZAREA 0.47
Alexander ALEXZANDE 0.47
I found that Levenshtein distance was more useful for the specific matching problems on names.
The Jaro-Winkler distance formula is biased towards strings with a common beginning. For example, Valentina and Valentiria.
It also has some not so intuitive "rules" (see wikipedia).
You should probably first determine what kind of dissimilarity you are expecting, and then looking for a suitable distance formula. For example, in writing, "angleworm" and "angelworm" is a very likely error, so the distance between the two strings ought to be low. While mismatching "there" and "three" is less likely and "ether" even more so. With longer anagrams, the Jaro distance might be exactly the same, and even the Winkler correction might not kick in.
As you can read in this page (emphasis mine)