I need to calculate the cosine similarity between two lists, let's say for example list 1 which is dataSetI
and list 2 which is dataSetII
. I cannot use anything such as numpy or a statistics module. I must use common modules (math, etc) (and the least modules as possible, at that, to reduce time spent).
Let's say dataSetI
is [3, 45, 7, 2]
and dataSetII
is [2, 54, 13, 15]
. The length of the lists are always equal.
Of course, the cosine similarity is between 0 and 1, and for the sake of it, it will be rounded to the third or fourth decimal with format(round(cosine, 3))
.
Thank you very much in advance for helping.
I don't suppose performance matters much here, but I can't resist. The zip() function completely recopies both vectors (more of a matrix transpose, actually) just to get the data in "Pythonic" order. It would be interesting to time the nuts-and-bolts implementation:
That goes through the C-like noise of extracting elements one-at-a-time, but does no bulk array copying and gets everything important done in a single for loop, and uses a single square root.
ETA: Updated print call to be a function. (The original was Python 2.7, not 3.3. The current runs under Python 2.7 with a
from __future__ import print_function
statement.) The output is the same, either way.CPYthon 2.7.3 on 3.0GHz Core 2 Duo:
So, the unpythonic way is about 3.6 times faster in this case.
You can use
cosine_similarity
function formsklearn.metrics.pairwise
docsIf you happen to be using PyTorch already, you should go with their CosineSimilarity implementation.
Suppose you have two
n
-dimensionalnumpy.ndarray
s,v1
andv2
, i.e. their shapes are both(n,)
. Here's how you get their cosine similarity:Or suppose you have two
numpy.ndarray
sw1
andw2
, whose shapes are both(m, n)
. The following gets you a list of cosine similarities, each being the cosine similarity between a row inw1
and the corresponding row inw2
:You can do this in Python using simple function:
Using numpy compare one list of numbers to multiple lists(matrix):