Comparison of the runtime of Nearest Neighbor quer

2019-01-12 11:16发布

问题:

Given n points in d-dimensional space, there are several data structures, such as Kd-Trees, Quadtrees, etc. to index the points. On these data structures it is possible to implement straight-forward algorithm for nearest neighbor queries around a given input point. Is there a book, paper, survey, ... that compares the theoretical (mostly expected) runtime of the nearest neighbor query on different data structures? The data I am looking at is composed of fairly small point clouds, so it can all be processed in main memory. For the sake of simplicity, I assume the data to be uniformly distributed. That is, im am not interested in real-world performance, but rather theoretical results

回答1:

You let the dimension of the points undefined and you just give an approximation for the number of points. What does small means? It's relative what one person means by small.

What you search, of course, doesn't exist. Your question is pretty much this:


Question:

For a small (whatever does small means to you) dataset, of any dimension with data that follow a uniform distribution, what's the optimal data structure to use?

Answer:

There is no such data structure.


Wouldn't it be too strange to have an answer on that? A false analogy would be to put as a synonym of this question, the "Which is the optimal programming language?" question that most of the first year undergrads have. Your question is not that naive, but it's walking on the same track.


Why there is no such data structure?

Because, the dimension of the dataset is variable. This means, that you might have a dataset in 2 dimensions, but it could also mean that you have a dataset in 1000 dimensions, or even better a dataset in 1000 dimensions, with an intrinsic dimension that is much less than 1000. Think about it, could one propose a data structure that would behave equally good for the three datasets I mentioned it? I doubt it.

In fact, there are some data structures that behave really nicely in low dimensions (quadtrees and KD-trees for example), while others are doing much better in higher dimensions (RKD-tree forest for instance).

Moreover, the algorithms and the expectations used for Nearest Neighbour search are heavily dependent on the dimension of the dataset (as well as the size of the dataset and the nature of the queries (for example a query that is too far from the dataset or equidistant from the points of the dataset will probably result in a slow search performance)).

In lower dimensions, one would perform a k-Nearest Neighbour(k-NN) search. In higher dimensions, it would be more wise to perform k-Approximate NN search. In this case, we follow the following trade-off:

Speed VS accuracy

What happens is that we achieve faster execution of the program, by sacrificing the correctness of our result. In other words, our search routine will be relatively fast, but it may (the possibility of this depends on many parameters, such as the nature of your problem and the library you are using) not return the true NN, but an approximation of the exact NN. For example it might not find the exact NN, but the third NN to the query point. You could also check the approximate-nn-searching wiki tag.

Why not always searching for the exact NN? Because of the curse of dimensionality, which results in the solutions provided in the lower dimensions to behave as good as the brute force would do (search all the points in the dataset for every query).


You see my answer already got big, so I should stop here. Your question is too broad, but interesting, I must admit. :)


In conclusion, which would be the optimal data structure (and algorithm) to use depends on your problem. The size of the dataset you are handling, the dimension and the intrinsic dimension of the points play a key role. The number and the nature of the queries also play an important role.



回答2:

For nearest neighbor searches of potentially non-uniform point data I think a kd-tree will give you the best performance in general. As far as broad overviews and theoretical cost analysis I think Wikipedia is an OK place to start, but keep in mind it does not cover much real-world optimization:

http://en.wikipedia.org/wiki/Nearest_neighbor_search

http://en.wikipedia.org/wiki/Space_partitioning

Theoretical performance is one thing but real world performance is something else entirely. Real world performance depends as much on the details of the data structure implementation as it does on the type of data structure. For example, a pointer-less (compact array) implementation can be many times faster than a pointer-based implementation because of improved cache coherence and faster data allocation. Wider branching may be slower in theory but faster in practice if you leverage SIMD to test several branches simultaneously.

Also the exact nature of your point data can have a big impact on performance. Uniform distributions are less demanding and can be handled quickly with simpler data structures. Non-uniform distributions require more care. (Kd-trees work well for both uniform and non-uniform data.) Also, if your data is too large to process in-core then you will need to take an entirely different approach compared to smaller data sets.