ALGORITHM - String similarity score/hash-第2页回答

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number/scores (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) scores/hashes.

Let's consider these strings and scores as an example:

Hello world 1000

Hello world! 1010

Hello earth 1125

Foo bar 3250

FooBarbar 3750

Foo Bar! 3300

Foo world! 2350

You can see that Hello world! and Hello world are similar and their scores are close to each other.

This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.

My end aim is : there would be streaming log messages(only pure messages) and i wanna find the pattern of those messages(some sort of regular expression type).But that gets started only when i can bucket similar strings. I again focus that I should get some number/scores (hash) for each string AND THAT CAN LATER tell me that two strings are or are not similar

标签： python string algorithm cluster-analysis hash

8条回答

女痞

2楼-- · 2019-03-11 09:30

You might want to look at using a BK-Tree. Here is a discussion and python implementation.

A BK-Tree stores strings in a tree, sorted by Levenshtein distance to the parent nodes. This is normally used to prune the search space when looking for similar strings, but it seems that this tree would form a natural ordering that could be used to create clusters.

0人赞添加讨论(0) 举报

迷人小祖宗

3楼-- · 2019-03-11 09:32

Have a look at locality-sensitive hashing.

The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).

There's a very good explanation available here together with some sample code.

0人赞添加讨论(0) 举报

上一页 1 2

ALGORITHM - String similarity score/hash

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间