How to efficiently find k-nearest neighbours in hi

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)

My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.

My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?

标签： algorithm data-structures computational-geometry nearest-neighbor dimensionality-reduction

6条回答

闹够了就滚

2楼-- · 2019-01-17 12:25

use a kd-tree

Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.

reduce the number of dimensions

Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.

By accuracy I mean finding the exact Nearest Neighbor (NN).

Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.

Is there some clever algorithm or data structure to solve this exactly in reasonable time?

Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).

That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.

You could read more about ANNS in the introduction our kd-GeRaF paper.

A good idea is to combine ANNS with dimensionality reduction.

Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).

FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.

0人赞添加讨论(0) 举报

趁早两清

3楼-- · 2019-01-17 12:28

No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.

0人赞添加讨论(0) 举报

Anthone

4楼-- · 2019-01-17 12:36

BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.

0人赞添加讨论(0) 举报

爷、活的狠高调

5楼-- · 2019-01-17 12:39

The Wikipedia article for kd-trees has a link to the ANN library:

ANN is a library written in C++, which supports data structures and algorithms for both exact and approximate nearest neighbor searching in arbitrarily high dimensions.

Based on our own experience, ANN performs quite efficiently for point sets ranging in size from thousands to hundreds of thousands, and in dimensions as high as 20. (For applications in significantly higher dimensions, the results are rather spotty, but you might try it anyway.)

As far as algorithm/data structures are concerned:

The library implements a number of different data structures, based on kd-trees and box-decomposition trees, and employs a couple of different search strategies.

I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).

0人赞添加讨论(0) 举报

该账号已被封号

6楼-- · 2019-01-17 12:41

You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.

0人赞添加讨论(0) 举报

贪生不怕死

7楼-- · 2019-01-17 12:45

One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point. As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.

@Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.

If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.

if you need sorted K neighbours, sort the output again

see

Documentation for argpartition

0人赞添加讨论(0) 举报

How to efficiently find k-nearest neighbours in hi

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间