可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have two lists with usernames and I want to calculate the Jaccard similarity. Is it possible?
This thread shows how to calculate the Jaccard Similarity between two strings, however I want to apply this to two lists, where each element is one word (e.g., a username).
回答1:
I ended up writing my own solution after all:
def jaccard_similarity(list1, list2):
intersection = len(list(set(list1).intersection(list2)))
union = (len(list1) + len(list2)) - intersection
return float(intersection) / union
回答2:
@aventinus I don't have enough reputation to add a comment to your answer, but just to make things clearer, your solution measures the jaccard_similarity
but the function is misnamed as jaccard_distance
, which is actually 1 - jaccard_similarity
回答3:
For Python 3:
def jaccard_similarity(list1, list2):
s1 = set(list1)
s2 = set(list2)
return len(s1.intersection(s2)) / len(s1.union(s2))
list1 = ['dog', 'cat', 'cat', 'rat']
list2 = ['dog', 'cat', 'mouse']
jaccard_similarity(list1, list2)
>>> 0.5
For Python2 use return len(s1.intersection(s2)) / float(len(s1.union(s2)))
回答4:
Assuming your usernames don't repeat, you can use the same idea:
def jaccard(a, b):
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))
list1 = ['dog', 'cat', 'rat']
list2 = ['dog', 'cat', 'mouse']
# The intersection is ['dog', 'cat']
# union is ['dog', 'cat', 'rat', 'mouse]
words1 = set(list1)
words2 = set(list2)
jaccard(words1, words2)
>>> 0.5
回答5:
You can use the Distance library
#pip install Distance
import distance
distance.jaccard("decide", "resize")
# Returns
0.7142857142857143
回答6:
@Aventinus (I also cannot comment): Note that Jaccard similarity is an operation on sets, so in the denominator part it should also use sets (instead of lists). So for example jaccard_similarity('aa', 'ab')
should result in 0.5
.
def jaccard_similarity(list1, list2):
intersection = len(set(list1).intersection(list2))
union = len(set(list1)) + len(set(list2)) - intersection
return intersection / union
Note that in the intersection, there is no need to cast to list first. Also, the cast to float is not needed in Python 3.
回答7:
If you'd like to include repeated elements, you can use Counter
, which I would imagine is relatively quick since it's just an extended dict
under the hood:
from collections import Counter
def jaccard_repeats(a, b):
"""Jaccard similarity measure between input iterables,
allowing repeated elements"""
_a = Counter(a)
_b = Counter(b)
c = (_a - _b) + (_b - _a)
n = sum(c.values())
return n/(len(a) + len(b) - n)
list1 = ['dog', 'cat', 'rat', 'cat']
list2 = ['dog', 'cat', 'rat']
list3 = ['dog', 'cat', 'mouse']
jaccard_repeats(list1, list3)
>>> 0.75
jaccard_repeats(list1, list2)
>>> 0.16666666666666666
jaccard_repeats(list2, list3)
>>> 0.5