I need to calculate the cosine similarity between two lists, let's say for example list 1 which is dataSetI
and list 2 which is dataSetII
. I cannot use anything such as numpy or a statistics module. I must use common modules (math, etc) (and the least modules as possible, at that, to reduce time spent).
Let's say dataSetI
is [3, 45, 7, 2]
and dataSetII
is [2, 54, 13, 15]
. The length of the lists are always equal.
Of course, the cosine similarity is between 0 and 1, and for the sake of it, it will be rounded to the third or fourth decimal with format(round(cosine, 3))
.
Thank you very much in advance for helping.
without using any imports
can be replaced with
without using numpy.dot() you have to create your own dot function using list comprehension:
and then its just a simple matter of applying the cosine similarity formula:
You can use this simple function to calculate the cosine similarity:
I did a benchmark based on several answers in the question and the following snippet is believed to be the best choice:
The result makes me surprised that the implementation based on
scipy
is not the fastest one. I profiled and find that cosine in scipy takes a lot of time to cast a vector from python list to numpy array.another version based on
numpy
onlyYou can round it after computing:
If you want it really short, you can use this one-liner:
You should try SciPy. It has a bunch of useful scientific routines for example, "routines for computing integrals numerically, solving differential equations, optimization, and sparse matrices." It uses the superfast optimized NumPy for its number crunching. See here for installing.
Note that spatial.distance.cosine computes the distance, and not the similarity. So, you must subtract the value from 1 to get the similarity.