I have a database with n strings (n > 1 million), each string has 100 chars, each char is either a
, b
, c
or d
.
I would like to find the closest strings for each one, closest defines as having the smallest hamming distance. I would like to find the k-nearest strings for each one (k < 5).
Example
N = 5
i1 = aacbdbbb
i2 = abcbdbbb
i3 = bbcadabd
i4 = bbcadabb
HammingDistance(i1,i2) = 1
HammingDistance(i1,i3) = 5
HammingDistance(i1,i4) = 4
HammingDistance(i2,i3) = 4
HammingDistance(i2,i4) = 3
HammingDistance(i3,i4) = 1
For k=1 it should return {(i1,i2),(i2,i1),(i3,i4),(i4,i3)}. For k=2 it should return { (i1,{i2,i4}),(i2,{i1,i4}),(i3,{i4,i2}),(i4,{i3,i2})}. And so on. For each string should find k-nearest string.
Naive solution has O(n^2) time complexity. I would like to find solution with better complexity. I found some other solutions, but none of them was better than naive.
How can I optimally distribute such strings into clusters? One string could be in both or more clusters. Solution may be deterministic or probabilistic.