Given a sparse matrix listing, what's the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times.
Say the input matrix is:
A=
[0 1 0 0 1
0 0 1 1 1
1 1 0 1 0]
The sparse representation is:
A =
0, 1
0, 4
1, 2
1, 3
1, 4
2, 0
2, 1
2, 3
In Python, it's straightforward to work with the matrix-input format:
import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])
dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out
Gives:
array([[ 1. , 0.40824829, 0.40824829],
[ 0.40824829, 1. , 0.33333333],
[ 0.40824829, 0.33333333, 1. ]])
That's fine for a full-matrix input, but I really want to start with the sparse representation (due to the size and sparsity of my matrix). Any ideas about how this could best be accomplished? Thanks in advance.
You should check out
scipy.sparse
(link). You can apply operations on those sparse matrices just like how you use a normal matrix.You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output:
Results:
If you want column-wise cosine similarities simply transpose your input matrix beforehand:
The following method is about 30 times faster than
scipy.spatial.distance.pdist
. It works pretty quickly on large matrices (assuming you have enough RAM)See below for a discussion of how to optimize for sparsity.
If your problem is typical for large scale binary preference problems, you have a lot more entries in one dimension than the other. Also, the short dimension is the one whose entries you want to calculate similarities between. Let's call this dimension the 'item' dimension.
If this is the case, list your 'items' in rows and create
A
usingscipy.sparse
. Then replace the first line as indicated.If your problem is atypical you'll need more modifications. Those should be pretty straightforward replacements of basic
numpy
operations with theirscipy.sparse
equivalents.I have tried some methods above. However, the experiment by @zbinsd has its limitation. The sparsity of matrix used in the experiment is extremely low while the real sparsity is usually over 90%. In my condition, the sparse is with the shape of (7000, 25000) and the sparsity of 97%. The method 4 is extremely slow and I can't tolerant getting the results. I use the method 6 which is finished in 10 s. Amazingly, I try the method below and it's finished in only 0.247 s.
This efficient method is linked by enter link description here
This method seems to be somewhat faster than using sklearn's implementation if you pass in one pair of vectors at a time.
Building off of Vaali's solution:
This takes a sparse matrix (preferably a csr_matrix) and returns a csr_matrix. It should do the more intensive parts using sparse calculations with pretty minimal memory overhead.
I haven't tested it extensively though, so caveat emptor(Update: I feel confident in this solution now that I've tested and benchmarked it)Also, here is the sparse version of Waylon's solution in case it helps anyone, not sure which solution is actually better.
Both solutions seem to have parity with sklearn.metrics.pairwise.cosine_similarity
:-D
Update:
Now I have tested both solutions against my existing Cython implementation: https://github.com/davidmashburn/sparse_dot/blob/master/test/benchmarks_v3_output_table.txt and it looks like the first algorithm performs the best of the three most of the time.