I'm trying to understand the section 5. of this paper about LSH, in particular how to bucket the generated hashes. Quoting the linked paper:
Given bit vectors consisting of d bits each, we choose N = O(n 1/(1+epsilon) ) random permutations of the bits. For each random permutation σ, we maintain a sorted order O σ of the bit vectors, in lexicographic order of the bits permuted by σ. Given a query bit vector q, we find the approximate nearest neighbor by doing the following: For each permu- tation σ, we perform a binary search on O σ to locate the two bit vectors closest to q (in the lexicographic order ob- tained by bits permuted by σ). We now search in each of the sorted orders O σ examining elements above and below the position returned by the binary search in order of the length of the longest prefix that matches q. This can be done by maintaining two pointers for each sorted order O σ (one moves up and the other down). At each step we move one of the pointers up or down corresponding to the element with the longest matching prefix. (Here the length of the longest matching prefix in O σ is computed relative to q with its bits permuted by σ). We examine 2N = O(n 1/(1+epsilon) ) bit vec- tors in this way. Of all the bit vectors examined, we return the one that has the smallest Hamming distance to q.
I'm confused by this algorithm and I don't think that I understood how it works.
I found already this question on the topic, but I didn't understand the answer in the comments. Also in this question in point 2 the same algorithm is described but again, I don't understand how its works.
Can you please try to explain to me how it works step by step trying to be more simple as possible?
I even tried to make a list of things that I don't understand, but in practice is so bad written that I don't understand most of the sentences!
EDIT after gsamaras answer:
I mostly understood the answer, but I still have some doubts:
Is it correct to say that the total cost of performing the
N
permutations isO(Nnlogn)
, since we have to sort each one of them ?The permutation+sorting process described above is performed only once during the pre-processing or for every query
q
? It seems already pretty expensiveO(Nnlogn)
even in pre-processing, if we have to do this at query time it's a disaster :DAt the last point, where we compare
v0
andv4
toq
, we compare their permuted version or the original one (before their permutation)?
This question is somehow broad, so I am just going to give a minimal (abstract) example here:
We have 6 (=
n
) vectors in our dataset, withd
bits each. Let's assume that we do 2 (=N
) random permutation.Let the 1st random permutation begin! Remember that we permute the bits, not the order of the vectors. After permuting the bits, they maintain an order, for example:
Now the query vector,
q
, arrives, but it's (almost) unlikely that is going to be the same with a vector in our dataset (after the permutation), thus we won't find it by performing binary search.However, we are going to end up between two vectors. So now we can imagine the scenario to be like this (for example
q
lies between v0 and v3:Now we move either up or down pointer, seeking for the vi vector that will match at the most bits with
q
. Let's say it was v0.Similarly, we do the second permutation and we find the vector vi, let's say v4. we now compare v0 from the first permutation and v4, to see which one is closest to
q
, i.e. which one has the most bits equal withq
.Edit:
If they actually sort every permutation from scratch, then yes, but it's not clear for me how they do it.
ONCE.
I think they do it with the permuted version (see the parentheses before
2N
in the paper). But that wouldn't make any difference, since they permuteq
too with the same permute (σ
).This quora answer may shed some light too.