Cosine Similarity between 2 Number Lists-第2页回答

I need to calculate the cosine similarity between two lists, let's say for example list 1 which is dataSetI and list 2 which is dataSetII. I cannot use anything such as numpy or a statistics module. I must use common modules (math, etc) (and the least modules as possible, at that, to reduce time spent).

Let's say dataSetI is [3, 45, 7, 2] and dataSetII is [2, 54, 13, 15]. The length of the lists are always equal.

Of course, the cosine similarity is between 0 and 1, and for the sake of it, it will be rounded to the third or fourth decimal with format(round(cosine, 3)).

Thank you very much in advance for helping.

标签： python python-3.x cosine-similarity

11条回答

▲ chillily

2楼-- · 2020-01-24 06:58

I don't suppose performance matters much here, but I can't resist. The zip() function completely recopies both vectors (more of a matrix transpose, actually) just to get the data in "Pythonic" order. It would be interesting to time the nuts-and-bolts implementation:

import math
def cosine_similarity(v1,v2):
    "compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy/math.sqrt(sumxx*sumyy)

v1,v2 = [3, 45, 7, 2], [2, 54, 13, 15]
print(v1, v2, cosine_similarity(v1,v2))

Output: [3, 45, 7, 2] [2, 54, 13, 15] 0.972284251712

That goes through the C-like noise of extracting elements one-at-a-time, but does no bulk array copying and gets everything important done in a single for loop, and uses a single square root.

ETA: Updated print call to be a function. (The original was Python 2.7, not 3.3. The current runs under Python 2.7 with a from __future__ import print_function statement.) The output is the same, either way.

CPYthon 2.7.3 on 3.0GHz Core 2 Duo:

>>> timeit.timeit("cosine_similarity(v1,v2)",setup="from __main__ import cosine_similarity, v1, v2")
2.4261788514654654
>>> timeit.timeit("cosine_measure(v1,v2)",setup="from __main__ import cosine_measure, v1, v2")
8.794677709375264

So, the unpythonic way is about 3.6 times faster in this case.

0人赞添加讨论(0) 举报

男人必须洒脱

3楼-- · 2020-01-24 07:03

You can use cosine_similarity function form sklearn.metrics.pairwise docs

In [23]: from sklearn.metrics.pairwise import cosine_similarity

In [24]: cosine_similarity([[1, 0, -1]], [[-1,-1, 0]])
Out[24]: array([[-0.5]])

0人赞添加讨论(0) 举报

家丑人穷心不美

4楼-- · 2020-01-24 07:06

If you happen to be using PyTorch already, you should go with their CosineSimilarity implementation.

Suppose you have two n-dimensional numpy.ndarrays, v1 and v2, i.e. their shapes are both (n,). Here's how you get their cosine similarity:

import torch
import torch.nn as nn

cos = nn.CosineSimilarity()
cos(torch.tensor([v1]), torch.tensor([v2])).item()

Or suppose you have two numpy.ndarrays w1 and w2, whose shapes are both (m, n). The following gets you a list of cosine similarities, each being the cosine similarity between a row in w1 and the corresponding row in w2:

cos(torch.tensor(w1), torch.tensor(w2)).tolist()

0人赞添加讨论(0) 举报

唯我独甜

5楼-- · 2020-01-24 07:08

You can do this in Python using simple function:

def get_cosine(text1, text2):
  vec1 = text1
  vec2 = text2
  intersection = set(vec1.keys()) & set(vec2.keys())
  numerator = sum([vec1[x] * vec2[x] for x in intersection])
  sum1 = sum([vec1[x]**2 for x in vec1.keys()])
  sum2 = sum([vec2[x]**2 for x in vec2.keys()])
  denominator = math.sqrt(sum1) * math.sqrt(sum2)
  if not denominator:
     return 0.0
  else:
     return round(float(numerator) / denominator, 3)
dataSet1 = [3, 45, 7, 2]
dataSet2 = [2, 54, 13, 15]
get_cosine(dataSet1, dataSet2)

0人赞添加讨论(0) 举报

Cosine Similarity between 2 Number Lists

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间