I've got a MongoDB with about 1 million documents in it. These documents all have a string that represents a 256 bit bin of 1s and 0s, like:
0110101010101010110101010101
Ideally, I'd like to query for near binary matches. This means, if the two documents have the following numbers. Yes, this is Hamming Distance.
This is NOT currently supported in Mongo. So, I'm forced to do it in the application layer.
So, given this, I am trying to find a way to avoid having to do individual Hamming distance comparisons between the documents. that makes the time to do this basically impossible.
I have a LOT of RAM. And, in ruby, there seems to be a great gem (algorithms) that can create a number of trees, none of which I can seem to make work (yet) that would reduce the number of queries I'd need to make.
Ideally, I'd like to make 1 million queries, find the near duplicate strings, and be able to update them to reflect that.
Anyone's thoughts would be appreciated.
This sounds like an algorithmic problem of some sort. You could try comparing those with a similar number of 1 or 0 bits first, then work down through the list from there. Those that are identical will, of course, come out on top. I don't think having tons of RAM will help here.
You could also try and work with smaller chunks. Instead of dealing with 256 bit sequences, could you treat that as 32 8-bit sequences? 16 16-bit sequences? At that point you can compute differences in a lookup table and use that as a sort of index.
Depending on how "different" you care to match on, you could just permute changes on the source binary value and do a keyed search to find the others that match.
I ended up doing a retrieval of all the documents into memory.. (subset with the id and the string).
Then, I used a BK Tree to compare the strings.
The Hamming distance defines a metric space, so you could use the O(n log n) algorithm to find the closest pair of points, which is of the typical divide-and-conquer nature.
You can then apply this repeatedly until you have "enough" pairs.
Edit: I see now that Wikipedia doesn't actually give the algorithm, so here is one description.
Edit 2: The algorithm can be modified to give up if there are no pairs at distance less than
n
. For the case of the Hamming distance: simply count the level of recursion you are in. If you haven't found something at leveln
in any branch, then give up (in other words, never entern + 1
). If you are using a metric where splitting on one dimension doesn't always yield a distance of1
, you need to adjust the level of recursion where you give up.As far as I could understand, you have an input string
X
and you want to query the database for a document containing string fieldb
such that Hamming distance betweenX
anddocument.b
is less than some small numberd
.You can do this in linear time, just by scanning all of your
N
=1M documents and calculating the distance (which takes small fixed time per document). Since you only want documents with distance smaller thand
, you can give up comparison afterd
unmatched characters; you only need to compare all 256 characters if most of them match.You can try to scan fewer than
N
documents, that is, to get better than linear time.Let
ones(s)
be the number of1
s in strings
. For each document, storeones(document.b)
as a new indexed fieldones_count
. Then you can only query documents where number of ones is close enough toones(X)
, specifically,ones(X)
-d
<=document.ones_count
<=ones(X)
+d
. The Mongo index should kick in here.If you want to find all close enough pairs in the set, see @Philippe's answer.