python numpy pairwise edit-distance

2019-05-02 10:12发布

So, I have a numpy array of strings, and I want to calculate the pairwise edit-distance between each pair of elements using this function: scipy.spatial.distance.pdist from http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.spatial.distance.pdist.html

A sample of my array is as follows:

 >>> d[0:10]
 array(['TTTTT', 'ATTTT', 'CTTTT', 'GTTTT', 'TATTT', 'AATTT', 'CATTT',
   'GATTT', 'TCTTT', 'ACTTT'], 
  dtype='|S5')

However, since it doesn't have the 'editdistance' option, therefore, I want to give a customized distance function. I tried this and I faced the following error:

 >>> import editdist
 >>> import scipy
 >>> import scipy.spatial
 >>> scipy.spatial.distance.pdist(d[0:10], lambda u,v: editdist.distance(u,v))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 1150, in pdist
    [X] = _copy_arrays_if_base_present([_convert_to_double(X)])
  File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 153, in _convert_to_double
    X = np.double(X)
ValueError: could not convert string to float: TTTTT

2条回答
劫难
2楼-- · 2019-05-02 10:30

If you really must use pdist, you first need to convert your strings to numeric format. If you know that all strings will be the same length, you can do this rather easily:

numeric_d = d.view(np.uint8).reshape((len(d),-1))

This simply views your array of strings as a long array of uint8 bytes, then reshapes it such that each original string is on a row by itself. In your example, this would look like:

In [18]: d.view(np.uint8).reshape((len(d),-1))
Out[18]:
array([[84, 84, 84, 84, 84],
       [65, 84, 84, 84, 84],
       [67, 84, 84, 84, 84],
       [71, 84, 84, 84, 84],
       [84, 65, 84, 84, 84],
       [65, 65, 84, 84, 84],
       [67, 65, 84, 84, 84],
       [71, 65, 84, 84, 84],
       [84, 67, 84, 84, 84],
       [65, 67, 84, 84, 84]], dtype=uint8)

Then, you can use pdist as you normally would. Just make sure that your editdist function is expecting arrays of integers, rather than strings. You could quickly convert your new inputs by calling .tostring():

def editdist(x, y):
  s1 = x.tostring()
  s2 = y.tostring()
  ... rest of function as before ...
查看更多
我命由我不由天
3楼-- · 2019-05-02 10:32

def my_pdist(data,f):
    N=len(data)
    matrix=np.empty([N*(N-1)/2])
    ind=0
    for i in range(N):
        for j in range(i+1,N):
            matrix[ind]=f(data[i],data[j])
            ind+=1
    return matrix

查看更多
登录 后发表回答