Is there an efficient algorithm for fuzzy deduplic

2019-02-19 11:20发布

This question already has an answer here:

For example, I have a long list of strings, each string has about 30-50 characters, and I want to remove strings that are similar to some other string in that list (leaving only one occurrence from a family of duplicates).

I looked at various string similarity algorithms, for example, Levenstein distance and the method presented in this article. They do work, but it's painfully slow - the best algorithm I came up with exhibits O(n^2) complexity and takes ~1.5s to process list with 3000 strings.

Is there some fast way to deduplicate those lists?

2条回答
相关推荐>>
2楼-- · 2019-02-19 11:37

This problem occurs frequently when matching DNA strings(or re-assembling fragments). The first approach would be to split up the strings into kmers, substrings, with say 4 adjacent letters. So

abcdefgh

Would become:

abcd + bcde + cdef + defg + efgh

For the complete dictionary, these substrings can be enterered into a hashtable, each carrying as a payload a list of the original strings (their numbers) that contain them (and possibly the offset where they can be found)

To search, treat the string under test the same as the dictionary, and look its fragment up in the hashtable. Now a hit will result in all five fragments to be found, with the correct offsets. A partial hit will yield fewer than five fragments, but with the correct offsets.

Of course a lot of false-negative hits will result from the search, but by combining (logical AND) of the inverted index lists, and by only selecting the hits at about the right index, things get unique pretty fast.

For the problem size in the OP's question the running time would probably be a few (tens of ) milliseconds.

BTW: As a side effect of this method, substitutions will behave almost the same as indels. In the example they would spoin one (at the ends) to four (in the middle) kmer-matches. For larger strings this is not a problem, for small strings (like in the example it is (and you could use smaller fragments)

Update: I just read the link, and it appears they use 2-mers, too (and throw some statistics at it)

查看更多
▲ chillily
3楼-- · 2019-02-19 11:58

If your measure of similarity is strong (e.g. Levenshtein distance 1), then you can process your string list in order, generating all possible "close" strings to the current string and looking up that close string in your hashtable. If it is there, skip the original string. If not, output it and add it to the hashtable.

This algorithm depends on being able to generate all close strings to a string, and there not being too many of them. (This is what I mean by "strong" above.)

As a possible optimization, you could store more than just the original strings in the hashtable. For instance, if you wanted Levenshtein distance 3, you could store all strings distance 1 from your outputted strings in the hashtable, then look up distance 2 strings when checking a new string.

查看更多
登录 后发表回答