模糊字符串比较模糊字符串比较(Fuzzy String Comparison)

2019-05-13 17:35发布

我所努力完成是一个程序,它读取一个文件,将根据原句比较每个句子。 这是一种完美的匹配到原来的判决将接收的1分和一个句子,其是完全相反的将收到一个0。所有其他模糊句子将在1和0之间接收一个档次。

我不能确定使用何种操作,让我在Python 3完成这一点。

我已经包括在该文本1是原始和其他前述字符串是比较示例文本。

文字:样品

文本1:这是个月黑风高的夜晚。 我独自一人坐在红色椅子。 我并不完全独自因为我有三只猫。

文本20:这是一个黑暗和暴风雨的夜晚。 我独自一人坐在椅子绯红。 我并不完全独自因为我有三个猫//如果得分高点,但没有1

文本21:这是一个黑暗和暴风雨的夜晚。 我独自一人坐在一个深红色cathedra的。 我并不完全独自因为我有三个猫//不是文本20应该得分较低

文本22:我是独自一人坐在一个深红色cathedra的。 我并不完全独自为我有三个猫。 这是一个黑暗的暴风雨和夜。 //应该得分比文本21但不为低

文本24:这是个月黑风高的夜晚。 我并不孤单。 我不是坐在红色椅子。 我有三只猫。 //如果得分为0!

Answer 1:

有一个叫包fuzzywuzzy 。 通过PIP安装:

pip install fuzzywuzzy

简单的用法:

>>> from fuzzywuzzy import fuzz
>>> fuzz.ratio("this is a test", "this is a test!")
    96

该软件包是建立在顶部difflib 。 为什么不直接使用,你问? 除了是一个有点简单,它有许多不同的匹配方法(如令牌为了不敏感,部分字符串匹配),这使得它更强大的实践中。 该process.extract功能尤其有用:找到一组最佳匹配的字符串和比率。 从他们的自述:

部分比

>>> fuzz.partial_ratio("this is a test", "this is a test!")
    100

令牌排序比

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    90
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

令牌设置比

>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    84
>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    100

处理

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> process.extract("new york jets", choices, limit=2)
    [('New York Jets', 100), ('New York Giants', 78)]
>>> process.extractOne("cowboys", choices)
    ("Dallas Cowboys", 90)


Answer 2:

有一个在标准库(称为模块difflib ),可以比较字符串,返回基于他们的相似性的得分。 该SequenceMatcher类应该做你所追求的。

编辑:从Python提示符小例子:

>>> from difflib import SequenceMatcher as SM
>>> s1 = ' It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.'
>>> s2 = ' It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines.'
>>> SM(None, s1, s2).ratio()
0.9112903225806451

HTH!



Answer 3:

fuzzyset比快得多fuzzywuzzydifflib )两个索引和搜索。

from fuzzyset import FuzzySet
corpus = """It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines
    It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines
    I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night.
    It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats."""
corpus = [line.lstrip() for line in corpus.split("\n")]
fs = FuzzySet(corpus)
query = "It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats."
fs.get(query)
# [(0.873015873015873, 'It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines')]

警告:要小心,不要混用unicodebytes在fuzzyset。



Answer 4:

该任务被称为释义鉴定这是研究自然语言处理的活跃领域。 我已经挂了艺术的论文很多,你可以在github上找到开放源代码的几个州。

请注意,所有的回答问题,假设有两个句子之间的一些字符串/表面相似而实际上两句小串相似度可以语义相似。

如果你感兴趣的那种相似的,你可以使用跳过-思考 。 根据GitHub的导游安装软件去套用检测部分中自述:

import skipthoughts
model = skipthoughts.load_model()
vectors = skipthoughts.encode(model, X_sentences)

这是你的句子(X_sentences)转换为载体。 稍后,您可以找到两个向量的相似性:

similarity = 1 - scipy.spatial.distance.cosine(vectors[0], vectors[1])

这里我们假设向量[0]和矢量1是相应的载体X_sentences [0],X_sentences 1 ,你想找到他们的分数。

还有其他型号的句子转换为这你可以找到一个向量在这里 。

一旦你转换你的句子为载体的相似只是发现这些向量之间的余弦相似性的问题。



文章来源: Fuzzy String Comparison