How to calculate Jaro Winkler distance matrix of strings in Python?
I have a large array of hand-entered strings (names and record numbers) and I'm trying to find duplicates in the list, including duplicates that may have slight variations in spelling. A response to a similar question suggested using Scipy's pdist function with a custom distance function. I've tried to implement this solution with the jaro_winkler function in the Levenshtein package. The problem with this is that the jaro_winkler function requires a string input, whereas the pdict function seems to require a 2D array input.
Example:
import numpy as np
from scipy.spatial.distance import pdist
from Levenshtein import jaro_winkler
fname = np.array(['Bob','Carl','Kristen','Calr', 'Doug']).reshape(-1,1)
dm = pdist(fname, jaro_winkler)
dm = squareform(dm)
Expected Output - Something like this:
Bob Carl Kristen Calr Doug
Bob 1.0 - - - -
Carl 0.0 1.0 - - -
Kristen 0.0 0.46 1.0 - -
Calr 0.0 0.93 0.46 1.0 -
Doug 0.53 0.0 0.0 0.0 1.0
Actual Error:
jaro_winkler expected two Strings or two Unicodes
I'm assuming this is because the jaro_winkler function is seeing an ndarray instead of a string, and I'm not sure how to convert the function input to a string in the context of the pdist function.
Does anyone have a suggestion to allow this to work? Thanks in advance!
You need to wrap the distance function, like I demonstrated in the following example with the Levensthein distance
import numpy as np
from Levenshtein import distance
from scipy.spatial.distance import pdist, squareform
# my list of strings
strings = ["hello","hallo","choco"]
# prepare 2 dimensional array M x N (M entries (3) with N dimensions (1))
transformed_strings = np.array(strings).reshape(-1,1)
# calculate condensed distance matrix by wrapping the Levenshtein distance function
distance_matrix = pdist(transformed_strings,lambda x,y: distance(x[0],y[0]))
# get square matrix
print(squareform(distance_matrix))
Output:
array([[ 0., 1., 4.],
[ 1., 0., 4.],
[ 4., 4., 0.]])
For anyone with a similar problem - One solution I just found is to extract the relevant code from the pdist function and add a [0] to the jaro_winkler function input to call the string out of the numpy array.
Example:
X = np.asarray(fname, order='c')
s = X.shape
m, n = s
dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
k = 0
for i in xrange(0, m - 1):
for j in xrange(i + 1, m):
dm[k] = jaro_winkler(X[i][0], X[j][0])
k = k + 1
dms = squareform(dm)
Even though this algorithm works I'd still like to learn if there's a "right" computer-sciency-way to do this with the pdist function. Thanks, and hope this helps someone!
Here's a concise solution that requires neither numpy nor scipy:
from Levenshtein import jaro_winkler
data = ['Bob','Carl','Kristen','Calr', 'Doug']
dm = [[ jaro_winkler(a, b) for b in data] for a in data]
print('\n'.join([''.join([f'{item:6.2f}' for item in row]) for row in dm]))
1.00 0.00 0.00 0.00 0.53
0.00 1.00 0.46 0.93 0.00
0.00 0.46 1.00 0.46 0.00
0.00 0.93 0.46 1.00 0.00
0.53 0.00 0.00 0.00 1.00