How Locality Sensitive Hashing (LSH) works?

2019-05-25 08:56发布

问题:

I've read already this question, but unfortunately it didn't help.

What I don't understand is what we do once we understood which bucket assign to our high-dimensional space query vector q: suppose that using our set of locality sensitive family functions h_1,h_2,...,h_n we have translated q to a low-dimension (n dimensions) hash code c.

Then c is the index of the bucket which q is assigned to and where (hopefully) are assigned also its nearest neighbors, let say that there are 100 vectors.

Now, what we do in order to find the q's NN is to compute the distance between q and only these 100 vectors, is that correct? So the use of c is only for indexing (it's used only for deciding which bucket assign q to`), right?


Another solution, as described in this survey (section 2.2), an alternative to "hash table lookup" (the previously described approach) is "fast distance approximation", so we perform an exhaustive research where we compute the distance between the c and the generated hash code relative to each element in the dataset. This is supposed to be fast since the hash code are in a low-dimension space AND the distance is supposed to be faster to compute (for example if the hash code space is binary, then we can use the XOR operator for quickly computing the hamming distance between two hash codes).

Now, what I wonder is: what are the advantages/disadvantages of the two methods? Why should I use one approach instead of the other one?

回答1:

The first described method explains an approximate nearest neighbors search. Yes you'd get the best performance by just checking those 100 other items in the bin c, but you've got a higher risk on missing good candidates in other neighboring buckets.

A simple hashing scheme for lat/lon coordinates is the Geohash. You could find nearest shop by looking at items within the same Geohash block, but inaccurate results are possible near grid boundaries.

One example of fast distance approximation can be found here, it finds image matches with sufficiently small Hamming distance (utilizing pHash). As hashes are only 64-bits long a laptop GPU could do 700 million comparisons / second. Note that all images are checked, no indexing or hashing data structure is used. This way you age guaranteed to get all matches (in pHash space).