Incremental Nearest Neighbor Algorithm in Python

2020-05-26 17:03发布

问题:

Is anyone aware of a nearest neighbor algorithm implemented in Python that can be updated incrementally? All the ones I've found, such as this one, appear to be batch processes. Is it possible to implement an incremental NN algorithm?

回答1:

I think the problem with incremental construction of a KD-tree or KNN-tree is, as you've alluded to in a comment, that the tree will eventually become unbalanced and you can't do simple tree rotation to fix balance problems and keep consistency. At the minimum, the re-balancing task is not trivial and one would definitely not want to do it at each insertion. Often, one will choose to build a tree with a batch method, insert a bunch of new points and allow the tree to become unbalanced up to a point, and then re-balance it.

A very similar thing to do is to build the data structure in batch for M points, use it for M' points, and then re-build the data structure in batch with M+M' points. Since re-balancing is not normal, fast algorithm we are familiar with for trees, rebuilding is not necessarily slow in comparison and in some cases can be faster (depending on how the sequence of the points entering your incremental algorithm).

That being said, the amount of code you write, debugging difficulty, and the ease of others' understanding of your code can be significantly smaller if you take the rebuild approach. If you do so, you can use a batch method and keep an external list of points not yet inserted into the tree. A brute force approach can be used to ensure none of these is closer than the ones in the tree.

Some links to Python implementations/discussions are below, but I haven't found any that explicitly claim to be incremental. Good luck.

http://www.scipy.org/Cookbook/KDTree

http://cgi.di.uoa.gr/~compgeom/pycgalvisual/kdppython.shtml

http://sites.google.com/site/mikescoderama/Home/kd-tree-knn

http://en.wikipedia.org/wiki/Kd-tree

Note: My comments here apply to high-dimensional spaces. If you're working in 2D or 3D, what I've said may not be appropriate. (If you working in very high dimensional spaces, use brute force or approximate nearest neighbor.)



回答2:

This is way late, but for posterity:

There is actually a technique for converting batch-processed algorithms like KD-Tree into incremental algorithms: it's called a static-to-dynamic transformation.

To generate an incremental variant of a KD-Tree, you store a set of trees instead of just one tree. When there are N elements in your nearest-neighbor structure, your structure will have a tree for each "1" bit in the binary representation of N. Moreover, if tree T_i corresponds to the i-th bit of N, then tree T_i contains 2^i elements.

So, if you have 11 elements in your structure, then N = 11, or 1011 in binary, and therefore you have three trees - T_3, T_1, and T_0 - with 8 elements, 2 elements, and 1 element, respectively.

Now, let's insert an element e into our structure. After insertion, we'll have 12 elements, or 1100 in binary. Comparing the new and the previous binary string, we see that T_3 doesn't change, we have a new tree T_2 with 4 elements, and trees T_1 and T_0 get deleted. We construct the new tree T_2 by doing a batch insertion of e along with all the elements in the trees "below" T_2, which are T_1 and T_0.

In this way, we create an incremental point query structure from a static base structure. There is, however, an asymptotic slowdown in "incrementalizing" static structures like this in the form of an extra log(N) factor:

  • inserting N elements in structure: O(N log(N) log(n))
  • nearest neighbor query for structure with N elements: O(log(n) log(n))


回答3:

There is. The Scipy Cookbook WebSite includes a complete implementation of a kNN algorithm that can be updated incrementally.

Maybe a few lines of background would be helpful for anyone interested but not familiar with the terminology.

A kNN engine is powered by either of two data representations--the pairwise distances between all points in the dataset stored in a multi-dimensional array (a distance matrix), or a kd-tree, which just stores the data points themselves in a multi-dimensional binary tree.

These are only two operations that a kd-tree-based KNN algorithm needs: you create the tree from the dataset (analogous to the training step performed in batch mode in other ML algorithms), and you search the tree to find 'nearest neighbors' (analogous to the testing step).

Online or incremental training in the context of a KNN algorithm (provided it's based on a kd-tree) means to insert nodes to an already-built kd-tree.

Back to the kd-Tree implementation in the SciPy Cookbook: The specific lines of code responsible for node insertion appear after the comment line "insert node in kd-tree" (in fact, all of the code after that comment are directed to node insertion).

Finally, there is a kd-tree implementation in the spatial module of the SciPy library (scipy.spatial module) called KDTree (scipy.spatial.KDTree) but i don't believe it supports node insertion, at least such a function is not in the Docs (i haven't looked at the source).