I'm trying to implement the Bag of Features model.
Given a descriptors matrix object (representing an image) belonging to the initial dataset, compute its histogram is easy, since we already know to which cluster each descriptor vector belongs to from k-means.
But what about if we want to compute the histogram of a query matrix? The only solution that crosses my mind is to compute the distance between each vector descriptor to each of the k
cluster centroids.
This can be inefficient: supposing that k=100
(so 100 centroids), then we have an query image represented through 1000 SIFT descriptors, so a matrix 1000x100
.
What we have to do now is computing 1000 * 100
eucledian distances in 128
dimensions. This seems really inefficient.
How to solve this problem?
NOTE: can you suggest me some implementations where this point is explained?
NOTE: I know LSH is a solution (since we are using high-dim vectors), but I don't think that actual implementations use it.
UPDATE:
I was talking with a collegue of mine: using a hierarchical cluster approach instead of classic k-means, should speed up the process so much! Is it correct to say that if we have k
centroids, with an hierarchical cluster we have to do only log(k)
comparisons in order to find the closest centroid instead of k
comparisons?