Best data structure for high dimensional nearest n

2020-07-24 05:09发布

问题:

I'm actually working on high dimensional data (~50.000-100.000 features) and nearest neighbors search must be performed on it. I know that KD-Trees has poor performance as dimensions grows, and also I've read that in general, all space-partitioning data structures tends to perform exhaustive search with high dimensional data.

Additionally, there are two important facts to be considered (ordered by relevance):

  • Precision: The nearest neighbors must be found (not approximations).
  • Speed: The search must be as fast as possible. (The time to create the data structure isn't really important).

So, I need some advice about:

  1. The data structure to perform k-NN.
  2. If it will be better to use an aNN (approximate nearest neighbor) approach, setting it as accurate as possible?.

回答1:

Can I perform NN search in high dimensional space?

No. Because of the curse of dimensionality, data structures that perform Nearest Neighbour search good in lower dimensions, fail to perform well in a high dimensional place. As a matter of fact, the query times becomes almost equally to the brute force one, thus it is worthless.

As a result, in a high dimensional space, one should go for Approximate Nearest Neighbour (ANN) search. To be honest, it's a must.

Which data structure to perform ANN?

I would suggest LSH, or a number of RKD trees. In my answer here, I mention some good libraries that perform ANN in C++. However, notice that LSH solved the R-nearest neighbour problem, thus you specify the parameter R, which is actually the radius. Then, LSH will look for NN's inside that R from the query point, thus you can't really request k NN's.

On the other hand, RKD trees can do that and return you k NN's. I have a project, that builds a forest of RKD trees and performs ANN search in C++, but it targets only high dimensions. It can handle GIST datasets of 10^6 images in 960 dimensions in < 1 sec with about 90% outputs being true nearest neighbours. The name is kd-GeRaF. It will be updated in the next month with a distributed version, but it's already tested and ready to use. It also has a cute logo. :)


I also feel that you should read my answer, which says that the optimal data structure depends on the data.



回答2:

I don't think it would be wise to conduct clustering in such high dimensional data. There are curse of dimension problem.

The concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges

I suggest you find a good measure of the distance, rather than direct Euclidean distance on the high dimension space.

Some possible solution are listed in this page, https://en.wikipedia.org/wiki/Clustering_high-dimensional_data

2.1 Subspace clustering

2.2 Projected clustering

2.3 Hybrid approaches

2.4 Correlation clustering