I'm using the following class to calculate the Jaro-Winkler distance between two strings. What I'm noticing is that the distance calculated between string A and B is not always the same as string B and A. Is this to be expected?
RAMADI ~ TRADING
0.73492063492063
TRADING ~ RAMADI
0.71825396825397
Turns out, there is a bug in the PHP versions of the Jaro-Winkler string comparison method found many places online.
Currently, string A compared to string B will yield a different result to string B compared to string A, when either string A or B contains a character found in both strings, that is found more than once in one of the string. This is incorrect. The Jaro-Winkler method should yield the same result when comparing the match value from A compared to B with B compared to A.
To rectify this, when identifying the common characters, the same character should not be repeated. The common characters variable needs to be deduplicated before returned.
The below code replaces the common characters string with an array that uses the common character as the key, to avoid duplication. By using the code below, A compared to B yields the same results as B compared to A.
This is inline with the C# version of the method.