I have a process, similar to tineye that generates perceptual hashes, these are 32bit ints.
I intend to store these in a sql database (maybe a nosql db) in the future
However, I'm stumped at how I would be able to retrieve records based on the similarity of hashes.
Any Ideas?
A common approach (at least common to me) is to divide your hash bit string in several chunks and query on these chunks for an exact match. This is a "pre-filter" step. You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.
So in simple terms: Say you have a bunch of 32 bits hashes in a DB and that you want to find every hash that are within a 4 bits hamming distance of your "query" hash:
create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4.
slice your query hash the same way in qslice 1 to 4.
query this table such that any of
qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4
. This gives you every DB hash that are within 3 bits (4 - 1
) of the query hash.for each returned hash, compute the exact hamming distance pair-wise with you query hash (reconstructing the index-side hash from the four slices)
The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table.
This approach was first described afaik by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:
Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms":
This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:
PS: Most of these fine brains are/were associated with Google at some level or some time for these, FWIW.
David's discussion is correct, but if you don't have a lot of data, check out Hamming distance on binary strings in SQL
To find hamming distance, you can just use bitwise addition and subtraction (& and ~ on the integers) in order to compute these.
SQL isn't made for this sort of processing. The comparisons on large data sets get very messy, and will not have the speed of a query that utilizes the strength of the system. That said, I've done similar things.
This will give you individual differences, which would need to be run on the full data set and ordered, which is messy at best. If you want it to run faster, you will need to use strategies like indexing by "region," or finding natural groupings in your data. There are umbrella clustering strategies, and similar - there is a lot of literature. It will, however, be messy in most traditional Database systems.