Content Based Image Retrieval (CBIR): Bag

I've read a lot of papers about the Nearest Neighbor problem, and it seems that indexing techniques like randomized kd-trees or LSH has been successfully used for Content Based Image Retrieval (CBIR), which can operate in high dimensional space. One really common experiment is given a SIFT query vector, find the most similar SIFT descriptor in the dataset. If we repeat the process with all the detected SIFT descriptors we can find the most similar image.

However, another popular approach is using Bag of Visual Words and convert all the SIFT descriptors detected into an huge sparse vector, which can be indexed with the same text techniques (e.g. inverted index).

My question is: these two different approaches ( matching the SIFT descriptors through Nearest Neighbor technique VS Bag of Features on SIFT descriptors + invert index) are extremely different and I don't understand which one is better.

If the second approach is better, what is the application of Nearest Neighbor in Computer Vision / Image Processing?

Oh boy, you are asking a question that even the papers can't answer, I think. In order to compare, one should take the state-of-the-art technologies of both approaches and compare them, measure speed, accuracy and recall. The one with the best characteristics is better than the other.

Personally, I hadn't heard much of the Bag of Visual Words, I had used the bag of words model only in text related projects, not images-relevant ones. Moreover, I am pretty sure I have seen many people use the 1st approach (including me and our research).

That's the best I got, so if I were you I would search for a paper that compares these two approaches, and if I couldn't find one, I would find the best representative of both approaches (the link you posted has a paper of 2009, that's old I guess), and check their experiments.

But be careful! In order to compare the approaches by the best representatives, you need to make sure that the experiments of each paper are super-relevant, the machines used are of the same "powerness", the data used are of the same nature and size, and so on.