I have a list of ~1 million unique 16-character strings (an array called VEC) and I want to calculate the minimum pair-wise hamming distance for each one in Python (an array called RES). Basically, I'm calculating the full pair-wise distance matrix one row at a time but only storing the minimum value in RES for each row.
VEC= ['AAAAAAAAAAAAAAAA','AAAAAAAAAAAAAAAT','AAAAGAAAAAATAAAA'...]
so that dist(VEC[1],VEC[2])=1, dist(VEC[1],VEC[3])=2 etc... and RES[1]=1. Using tips and tricks from these pages I came up with:
#METHOD#1:
import Levenshtein
import numpy
RES=99*numpy.ones(len(VEC))
i=0
for a in VEC:
dist=numpy.array([Levenshtein.hamming(a,b) for b in VEC] ) #array of distances
RES[i]=numpy.amin(dist[dist>0]) #pick min distance greater than zero
i+=1
a shortened VEC of only 10,000 took about 70 sec, but if I extrapolate that to the full million it will take 8 days. My approach seems wasteful since I'm recalculating the symmetric parts of the distance matrix so I tried to calculate half of the matrix while updating RES for each row as I went along:
#METHOD #2:
import Levenshtein
import numpy
RES=99*numpy.ones(len(VEC))
for i in range(len(VEC)-1):
dist=[Levenshtein.hamming(VEC[i],VEC[j]) for j in range(i+1, len(VEC))]
RES[i]=min(numpy.amin(dist),RES[i])
#update RES as you go along:
k=0
for j in range(i+1,len(VEC)):
if dist[k]<RES[j]:
RES[j]=dist[k]
k+=1
Probably not surprisingly, this 2nd approach takes almost twice as long (117 sec) so it isn't very good. Regardless, can anyone recommend improvements/changes to make this faster?
If you only need the nearest neighbor for each bitarry (ignoring itself), and you could get away with a tiny chance of only getting an approximate nearest neighbor, you might consider implementing the "Bit Sampling" Locality Sensitive Hash for Hamming distance. In a nutshell, create three hash tables. From each 128-bit input, sample 16 bits, 3 times, using those 16 bit samples as keys. The values of your hash tables should be a list of all 128-bit inputs that had that sampled key. Once you place all million of your inputs into the LSH index, simply:
- Iterate over the million points
- For each input, perform the above 3-sampling
- Find the nearest neighbor in each of the three lists (with distance > 0), keep the best overall
Both loading and testing are ludicrously quick. I might recommend the excellent bitarray library for underpinning this.
I tried to use numpy. Here is my code:
#!/usr/bin/env python
import numpy as np
import time
def gen_data(n):
arr = np.empty(shape=(n, 16))
for i in range(n):
arr[i] = np.random.randint(ord('A'), ord('Z')+1, 16)
return arr
def distance_from_array(i, arr):
r = arr[i] != arr
r[i,:] = True
min_i = np.argmin(np.sum(r, axis=1))
return min_i
data = gen_data(1000000)
distances = []
start = time.time()
for i in range(200):
distances.append(distance_from_array(i, data))
end = time.time()
print end - start
You can convert your list of strings into an array of numbers. Then you can use numpy function for working with array, such as sum and argmin. I think you don't want to find only distances larger than 1, if it's possible that one string will appear twice.
I tested it on my computer and it takes about 10 seconds to process 200 strings. For each one you have to go through all 1 000 000 other strings, so we can compute the time it would take to process all of them fairly easily. It should be around 13 hours. However, don't forget that we are using only one core at the moment. If you split indexes and use http://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.pool you can get your results quite quickly.