Optimal way to cluster set of strings with hamming

2020-08-02 05:31发布

I have a database with n strings (n > 1 million), each string has 100 chars, each char is either a, b, c or d.

I would like to find the closest strings for each one, closest defines as having the smallest hamming distance. I would like to find the k-nearest strings for each one (k < 5).

Example

N = 5
i1 = aacbdbbb
i2 = abcbdbbb
i3 = bbcadabd
i4 = bbcadabb

HammingDistance(i1,i2) = 1
HammingDistance(i1,i3) = 5
HammingDistance(i1,i4) = 4
HammingDistance(i2,i3) = 4
HammingDistance(i2,i4) = 3
HammingDistance(i3,i4) = 1

For k=1 it should return {(i1,i2),(i2,i1),(i3,i4),(i4,i3)}. For k=2 it should return { (i1,{i2,i4}),(i2,{i1,i4}),(i3,{i4,i2}),(i4,{i3,i2})}. And so on. For each string should find k-nearest string.

Naive solution has O(n^2) time complexity. I would like to find solution with better complexity. I found some other solutions, but none of them was better than naive.

How can I optimally distribute such strings into clusters? One string could be in both or more clusters. Solution may be deterministic or probabilistic.

0条回答
登录 后发表回答