Calculate the euclidean distance in scipy csr matr

2019-02-18 03:16发布

问题:

I need to calculate the Euclidean Distance between all points that is stored in csr sparse matrix and some lists of points. It would be easier for me to convert the csr to a dense one, but I couldn't due to the lack of memory, so I need to keep it as csr.

So for example I have this data_csr sparse matrix (view in both, csr and dense):

data_csr
(0, 2)  4
(1, 0)  1
(1, 4)  2
(2, 0)  2
(2, 3)  1
(3, 5)  1
(4, 0)  4
(4, 2)  3
(4, 3)  2

data_csr.todense()
[[0, 0, 4, 0, 0, 0]
 [1, 0, 0, 0, 2, 0]
 [2, 0, 0, 1, 0, 0]
 [0, 0, 0, 0, 0, 1]
 [4, 0, 3, 2, 0, 0]]

and this center lists of points:

center
array([[0, 1, 2, 2, 4, 1],
      [3, 4, 1, 2, 4, 0]])

using the scipy.spatial package, the Euclidean Distance array between data_csr and center will be like the one below. So each point, of total 6 points, in each row of center was calculated against all rows in data_csr. The first row of the result array(2,5) is the ED between the first row of center and all rows in data_csr.

scipy.spatial.distance.cdist(center, data_csr, 'euclidean')

array([[ 5.09901951,  3.87298335,  5.19615242,  5.        ,  5.91607978],
      [ 7.34846923,  5.38516481,  5.91607978,  6.8556546 ,  6.08276253]])


What I've learned so far that I can get the nonzero values as well the indices with:

data_csr.data
array([4, 1, 2, 2, 1, 1, 4, 3, 2])

data_csr.indices
array([2, 0, 4, 0, 3, 5, 0, 2, 3])

But still I can't figure out how to calculate the ED between these two objects.

回答1:

So let's create your matrix (too bad you didn't give inputs that I could copy-n-paste)

In [114]: data=[4,1,2,2,1,1,4,3,2]   
In [115]: col=[0,1,1,2,2,3,4,4,4]
In [116]: row=[2,0,4,0,3,5,0,2,3]
In [117]: M=sparse.csr_matrix((data,(col,row)))

In [118]: M
Out[118]: 
<5x6 sparse matrix of type '<type 'numpy.int32'>'
    with 9 stored elements in Compressed Sparse Row format>

In [119]: M.A
Out[119]: 
array([[0, 0, 4, 0, 0, 0],
       [1, 0, 0, 0, 2, 0],
       [2, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1],
       [4, 0, 3, 2, 0, 0]])

In [121]: center=np.array([[0,1,2,2,4,1],[3,4,1,2,4,0]])

So how did you calculate the distance? M.A is (5,6), center is (2,6). It's not obvious what you are doing with these two arrays.

As for access to the 'raw' sparse values, the coo format is easiest to understand. It's the same row,col,data stuff I used to create the matrix

In [131]: M.tocoo().data
Out[131]: array([4, 1, 2, 2, 1, 1, 4, 3, 2])

In [132]: M.tocoo().col
Out[132]: array([2, 0, 4, 0, 3, 5, 0, 2, 3])

In [133]: M.tocoo().row
Out[133]: array([0, 1, 1, 2, 2, 3, 4, 4, 4])

The csr stores the same info in data, indices and indptr arrays. But you have to do some math to pull to the i,j values from the last 2. csr multiplication routines make good use of these arrays.

In general it is better to do multiplication with csr matrices than addition/subtraction.

I await further clarification.


spatial.distance.cdist(center,M.A, 'euclidean')
Out[156]: 
array([[ 5.09901951,  3.87298335,  5.19615242,  5.        ,  5.91607978],
       [ 7.34846923,  5.38516481,  5.91607978,  6.8556546 ,  6.08276253]])

What we need to do is study this function, and understand its inputs. We may have to go beyond its docs and look at the code.

But looking at this code I see steps to ensure that xB is 2d array, with the same number of columns as xA. Then for euclidian it calls

_distance_wrap.cdist_euclidean_wrap(_convert_to_double(XA),
                                    _convert_to_double(XB), dm)

which looks like a wrapper on some C code. I can't imagine any way of feeding it a sparse matrix.

You could iterate over rows; calling dist with M[[0],:].A is the same as M.A[[0],:] - except for speed. Iterating over rows of a sparse matrix is kind of slow, as much because it has to construct a new sparse matrix at each iteration. csr and lil are the 2 fastest for row iteration.

Here's something that might be faster - directly iterating on the attributes of the lil format:

 def foo(a,b,n):
    # make a dense array from data,row
    res = np.zeros((1,n))
    res[0,b]=a
    return res

In [190]: Ml=M.tolil()

In [191]: Ml.data
Out[191]: array([[4], [1, 2], [2, 1], [1], [4, 3, 2]], dtype=object)

In [192]: Ml.rows
Out[192]: array([[2], [0, 4], [0, 3], [5], [0, 2, 3]], dtype=object)

In [193]: rowgen=(foo(a,b,6) for a,b in zip(Ml.data,Ml.rows))

In [194]: np.concatenate([spatial.distance.cdist(center,row, 'euclidean') for row in rowgen],axis=1)
Out[194]: 
array([[ 5.09901951,  3.87298335,  5.19615242,  5.        ,  5.91607978],
       [ 7.34846923,  5.38516481,  5.91607978,  6.8556546 ,  6.08276253]])

For now I'll skip the time tests.



回答2:

Pairwise euclidian distance on sparse matrices is implemented in sklearn (as pointed out by hpaulj, the scipy implementation does not work on sparse matrices).

Example on hpaulj example:

import scipy.sparse
import sklearn.metrics.pairwise
data = [4,1,2,2,1,1,4,3,2]  
col = [0,1,1,2,2,3,4,4,4]
row = [2,0,4,0,3,5,0,2,3]
M = scipy.sparse.csr_matrix((data,(col,row)))
distances = sklearn.metrics.pairwise.pairwise_distances(M,M)

Documentation: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html