Find the similarity metric between two strings

2019-01-01 13:55发布

问题:

How do I get the probability of a string being similar to another string in Python?

I want to get a decimal value like 0.9 (meaning 90%) etc. Preferably with standard Python and library.

e.g.

similar(\"Apple\",\"Appel\") #would have a high prob.

similar(\"Apple\",\"Mango\") #would have a lower prob.

回答1:

There is a built in.

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

Using it:

>>> similar(\"Apple\",\"Appel\")
0.8
>>> similar(\"Apple\",\"Mango\")
0.0


回答2:

I think maybe you are looking for an algorithm describing the distance between strings. Here are some you may refer to:

  1. Hamming distance
  2. Levenshtein distance
  3. Damerau–Levenshtein distance
  4. Jaro–Winkler distance


回答3:

Fuzzy Wuzzy is a package that implements Levenshtein distance in python, with some helper functions to help in certain situations where you may want two distinct strings to be considered identical. For example:

>>> fuzz.ratio(\"fuzzy wuzzy was a bear\", \"wuzzy fuzzy was a bear\")
    91
>>> fuzz.token_sort_ratio(\"fuzzy wuzzy was a bear\", \"wuzzy fuzzy was a bear\")
    100


回答4:

Solution #1: Python builtin

use SequenceMatcher from difflib

pros: native python library, no need extra package.
cons: too limited, there are so many other good algorithms for string similarity out there.

example :
>>> from difflib import SequenceMatcher
>>> s = SequenceMatcher(None, \"abcd\", \"bcde\")
>>> s.ratio()
0.75

Solution #2: jellyfish library

its a very good library with good coverage and few issues. it supports:
- Levenshtein Distance
- Damerau-Levenshtein Distance
- Jaro Distance
- Jaro-Winkler Distance
- Match Rating Approach Comparison
- Hamming Distance

pros: easy to use, gamut of supported algorithms, tested.
cons: not native library.

example:

>>> import jellyfish
>>> jellyfish.levenshtein_distance(u\'jellyfish\', u\'smellyfish\')
2
>>> jellyfish.jaro_distance(u\'jellyfish\', u\'smellyfish\')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u\'jellyfish\', u\'jellyfihs\')
1


回答5:

You can create a function like:

def similar(w1, w2):
    w1 = w1 + \' \' * (len(w2) - len(w1))
    w2 = w2 + \' \' * (len(w1) - len(w2))
    return sum(1 if i == j else 0 for i, j in zip(w1, w2)) / float(len(w1))


回答6:

Package distance includes Levenshtein distance:

import distance
distance.levenshtein(\"lenvestein\", \"levenshtein\")
# 3


回答7:

The builtin SequenceMatcher is very slow on large input, here\'s how it can be done with diff-match-patch:

from diff_match_patch import diff_match_patch

def compute_similarity_and_diff(text1, text2):
    dmp = diff_match_patch()
    dmp.Diff_Timeout = 0.0
    diff = dmp.diff_main(text1, text2, False)

    # similarity
    common_text = sum([len(txt) for op, txt in diff if op == 0])
    text_length = max(len(text1), len(text2))
    sim = common_text / text_length

    return sim, diff