Fuzzy String Comparison

What I am striving to complete is a program which reads in a file and will compare each sentence according to the original sentence. The sentence which is a perfect match to the original will receive a score of 1 and a sentence which is the total opposite will receive a 0. All other fuzzy sentences will receive a grade in between 1 and 0.

I am unsure which operation to use to allow me to complete this in Python 3.

I have included the sample text in which the Text 1 is the original and the other preceding strings are the comparisons.

Text: Sample

Text 1: It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.

Text 20: It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines // Should score high point but not 1

Text 21: It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines // Should score lower than text 20

Text 22: I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night. // Should score lower than text 21 but NOT 0

Text 24: It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats. // Should score a 0!

标签： python nlp fuzzy-comparison

4条回答

走好不送

2楼-- · 2019-01-10 02:29

fuzzyset is much faster than fuzzywuzzy (difflib) for both indexing and searching.

from fuzzyset import FuzzySet
corpus = """It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines
    It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines
    I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night.
    It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats."""
corpus = [line.lstrip() for line in corpus.split("\n")]
fs = FuzzySet(corpus)
query = "It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats."
fs.get(query)
# [(0.873015873015873, 'It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines')]

Warning: Be careful not to mix unicode and bytes in your fuzzyset.

0人赞添加讨论(0) 举报

戒情不戒烟

3楼-- · 2019-01-10 02:43

There is a module in the standard library (called difflib) that can compare strings and return a score based on their similarity. The SequenceMatcher class should do what you are after.

EDIT: Small example from python prompt:

>>> from difflib import SequenceMatcher as SM
>>> s1 = ' It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.'
>>> s2 = ' It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines.'
>>> SM(None, s1, s2).ratio()
0.9112903225806451

HTH!

0人赞添加讨论(0) 举报

Root（大扎）

4楼-- · 2019-01-10 02:46

There is a package called fuzzywuzzy. Install via pip:

pip install fuzzywuzzy

Simple usage:

>>> from fuzzywuzzy import fuzz
>>> fuzz.ratio("this is a test", "this is a test!")
    96

The package is built on top of difflib. Why not just use that, you ask? Apart from being a bit simpler, it has a number of different matching methods (like token order insensitivity, partial string matching) which make it more powerful in practice. The process.extract functions are especially useful: find the best matching strings and ratios from a set. From their readme:

Partial Ratio

>>> fuzz.partial_ratio("this is a test", "this is a test!")
    100

Token Sort Ratio

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    90
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

Token Set Ratio

>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    84
>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    100

Process

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> process.extract("new york jets", choices, limit=2)
    [('New York Jets', 100), ('New York Giants', 78)]
>>> process.extractOne("cowboys", choices)
    ("Dallas Cowboys", 90)

0人赞添加讨论(0) 举报

再贱就再见

5楼-- · 2019-01-10 02:55

The task is called Paraphrase Identification which is an active area of research in Natural Language Processing. I have linked several state of the art papers many of which you can find open source code on GitHub for.

Note that all the answered question assume that there is some string/surface similarity between the two sentences while in reality two sentences with little string similarity can be semantically similar.

If you're interested in that kind of similarity you can use Skip-Thoughts. Install the software according to the GitHub guides and go to paraphrase detection section in readme:

import skipthoughts
model = skipthoughts.load_model()
vectors = skipthoughts.encode(model, X_sentences)

This converts your sentences (X_sentences) to vectors. Later you can find the similarity of two vectors by:

similarity = 1 - scipy.spatial.distance.cosine(vectors[0], vectors[1])

where we are assuming vector[0] and vector1 are the corresponding vector to X_sentences[0], X_sentences1 which you wanted to find their scores.

There are other models to convert a sentence to a vector which you can find here.

Once you convert your sentences into vectors the similarity is just a matter of finding the Cosine similarity between those vectors.

0人赞添加讨论(0) 举报

Fuzzy String Comparison

Text: Sample

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间