I'm using the following class to calculate the Jaro-Winkler distance between two strings. What I'm noticing is that the distance calculated between string A and B is not always the same as string B and A. Is this to be expected?
RAMADI ~ TRADING
0.73492063492063
TRADING ~ RAMADI
0.71825396825397
Demo
Turns out, there is a bug in the PHP versions of the Jaro-Winkler string comparison method found many places online.
Currently, string A compared to string B will yield a different result to string B compared to string A, when either string A or B contains a character found in both strings, that is found more than once in one of the string. This is incorrect. The Jaro-Winkler method should yield the same result when comparing the match value from A compared to B with B compared to A.
To rectify this, when identifying the common characters, the same character should not be repeated. The common characters variable needs to be deduplicated before returned.
The below code replaces the common characters string with an array that uses the common character as the key, to avoid duplication. By using the code below, A compared to B yields the same results as B compared to A.
This is inline with the C# version of the method.
//$commonCharacters='';
# The Common Characters variable must be an array
$commonCharacters = [];
for( $i=0; $i < $str1_len; $i++){
$noMatch = True;
// compare if char does match inside given allowedDistance
// and if it does add it to commonCharacters
for( $j= max( 0, $i-$allowedDistance ); $noMatch && $j < min( $i + $allowedDistance + 1, $str2_len ); $j++) {
if( $temp_string2[(int)$j] == $string1[$i] ){ // MJR
$noMatch = False;
//$commonCharacters .= $string1[$i];
# The Common Characters array uses the character as a key to avoid duplication.
$commonCharacters[$string1[$i]] = $string1[$i];
$temp_string2[(int)$j] = ''; // MJR
}
}
}
//return $commonCharacters;
# When returning, turn the array back to a string, as expected
return implode("", $commonCharacters);