ALGORITHM - String similarity score/hash

2019-03-11 08:54发布

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number/scores (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) scores/hashes.

Let's consider these strings and scores as an example:

Hello world 1000

Hello world! 1010

Hello earth 1125

Foo bar 3250

FooBarbar 3750

Foo Bar! 3300

Foo world! 2350

You can see that Hello world! and Hello world are similar and their scores are close to each other.

This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.

My end aim is : there would be streaming log messages(only pure messages) and i wanna find the pattern of those messages(some sort of regular expression type).But that gets started only when i can bucket similar strings. I again focus that I should get some number/scores (hash) for each string AND THAT CAN LATER tell me that two strings are or are not similar

8条回答
女痞
2楼-- · 2019-03-11 09:30

You might want to look at using a BK-Tree. Here is a discussion and python implementation.

A BK-Tree stores strings in a tree, sorted by Levenshtein distance to the parent nodes. This is normally used to prune the search space when looking for similar strings, but it seems that this tree would form a natural ordering that could be used to create clusters.

查看更多
迷人小祖宗
3楼-- · 2019-03-11 09:32

Have a look at locality-sensitive hashing.

The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).

There's a very good explanation available here together with some sample code.

查看更多
登录 后发表回答