Given a sparse matrix listing, what's the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times.
Say the input matrix is:
A=
[0 1 0 0 1
0 0 1 1 1
1 1 0 1 0]
The sparse representation is:
A =
0, 1
0, 4
1, 2
1, 3
1, 4
2, 0
2, 1
2, 3
In Python, it's straightforward to work with the matrix-input format:
import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])
dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out
Gives:
array([[ 1. , 0.40824829, 0.40824829],
[ 0.40824829, 1. , 0.33333333],
[ 0.40824829, 0.33333333, 1. ]])
That's fine for a full-matrix input, but I really want to start with the sparse representation (due to the size and sparsity of my matrix). Any ideas about how this could best be accomplished? Thanks in advance.
I took all these answers and wrote a script to 1. validate each of the results (see assertion below) and 2. see which is the fastest. Code and results are below:
Results:
I suggest to run in two steps:
1) generate mapping A that maps A:column index->non zero objects
2) for each object i (row) with non-zero occurrences(columns) {k1,..kn} calculate cosine similarity just for elements in the union set A[k1] U A[k2] U.. A[kn]
Assuming a big sparse matrix with high sparsity this will gain a significant boost over brute force
Hi you can do it this way