Cosine Similarity between 2 Number Lists

2020-01-24 06:17发布

I need to calculate the cosine similarity between two lists, let's say for example list 1 which is dataSetI and list 2 which is dataSetII. I cannot use anything such as numpy or a statistics module. I must use common modules (math, etc) (and the least modules as possible, at that, to reduce time spent).

Let's say dataSetI is [3, 45, 7, 2] and dataSetII is [2, 54, 13, 15]. The length of the lists are always equal.

Of course, the cosine similarity is between 0 and 1, and for the sake of it, it will be rounded to the third or fourth decimal with format(round(cosine, 3)).

Thank you very much in advance for helping.

11条回答
干净又极端
2楼-- · 2020-01-24 06:49

without using any imports

math.sqrt(x)

can be replaced with

x** .5

without using numpy.dot() you have to create your own dot function using list comprehension:

def dot(A,B): 
    return (sum(a*b for a,b in zip(A,B)))

and then its just a simple matter of applying the cosine similarity formula:

def cosine_similarity(a,b):
    return dot(a,b) / ( (dot(a,a) **.5) * (dot(b,b) ** .5) )
查看更多
萌系小妹纸
3楼-- · 2020-01-24 06:50

You can use this simple function to calculate the cosine similarity:

def cosine_similarity(a, b):
return sum([i*j for i,j in zip(a, b)])/(math.sqrt(sum([i*i for i in a]))* math.sqrt(sum([i*i for i in b])))
查看更多
Animai°情兽
4楼-- · 2020-01-24 06:51

I did a benchmark based on several answers in the question and the following snippet is believed to be the best choice:

def dot_product2(v1, v2):
    return sum(map(operator.mul, v1, v2))


def vector_cos5(v1, v2):
    prod = dot_product2(v1, v2)
    len1 = math.sqrt(dot_product2(v1, v1))
    len2 = math.sqrt(dot_product2(v2, v2))
    return prod / (len1 * len2)

The result makes me surprised that the implementation based on scipy is not the fastest one. I profiled and find that cosine in scipy takes a lot of time to cast a vector from python list to numpy array.

enter image description here

查看更多
Luminary・发光体
5楼-- · 2020-01-24 06:53

another version based on numpy only

from numpy import dot
from numpy.linalg import norm

cos_sim = dot(a, b)/(norm(a)*norm(b))
查看更多
迷人小祖宗
6楼-- · 2020-01-24 06:54
import math
from itertools import izip

def dot_product(v1, v2):
    return sum(map(lambda x: x[0] * x[1], izip(v1, v2)))

def cosine_measure(v1, v2):
    prod = dot_product(v1, v2)
    len1 = math.sqrt(dot_product(v1, v1))
    len2 = math.sqrt(dot_product(v2, v2))
    return prod / (len1 * len2)

You can round it after computing:

cosine = format(round(cosine_measure(v1, v2), 3))

If you want it really short, you can use this one-liner:

from math import sqrt
from itertools import izip

def cosine_measure(v1, v2):
    return (lambda (x, y, z): x / sqrt(y * z))(reduce(lambda x, y: (x[0] + y[0] * y[1], x[1] + y[0]**2, x[2] + y[1]**2), izip(v1, v2), (0, 0, 0)))
查看更多
三岁会撩人
7楼-- · 2020-01-24 06:56

You should try SciPy. It has a bunch of useful scientific routines for example, "routines for computing integrals numerically, solving differential equations, optimization, and sparse matrices." It uses the superfast optimized NumPy for its number crunching. See here for installing.

Note that spatial.distance.cosine computes the distance, and not the similarity. So, you must subtract the value from 1 to get the similarity.

from scipy import spatial

dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)
查看更多
登录 后发表回答