To a list of N
points [(x_1,y_1), (x_2,y_2), ... ]
I am trying to find the nearest neighbours to each point based on distance. My dataset is too large to use a brute force approach so a KDtree seems best.
Rather than implement one from scratch I see that sklearn.neighbors.KDTree
can find the nearest neighbours. Can this be used to find the nearest neighbours of each particle, i.e return a dim(N)
list?
This question is very broad and missing details. It's unclear what you did try, how your data looks like and what a nearest-neighbor is (identity?).
Assuming you are not interested in the identity (with distance 0), you can query the two nearest-neighbors and drop the first column. This is probably the easiest approach here.
Code:
import numpy as np
from sklearn.neighbors import KDTree
np.random.seed(0)
X = np.random.random((5, 2)) # 5 points in 2 dimensions
tree = KDTree(X)
nearest_dist, nearest_ind = tree.query(X, k=2) # k=2 nearest neighbors where k1 = identity
print(X)
print(nearest_dist[:, 1]) # drop id; assumes sorted -> see args!
print(nearest_ind[:, 1]) # drop id
Output
[[ 0.5488135 0.71518937]
[ 0.60276338 0.54488318]
[ 0.4236548 0.64589411]
[ 0.43758721 0.891773 ]
[ 0.96366276 0.38344152]]
[ 0.14306129 0.1786471 0.14306129 0.20869372 0.39536284]
[2 0 0 0 1]
You can use sklearn.neighbors.KDTree
's query_radius()
method, which returns a list of the indices of the nearest neighbours within some radius (as opposed to returning k nearest neighbours).
from sklearn.neighbors import KDTree
points = [(1, 1), (2, 2), (3, 3), (4, 4), (5, 5)]
tree = KDTree(points, leaf_size=2)
all_nn_indices = tree.query_radius(points, r=1.5) # NNs within distance of 1.5 of point
all_nns = [[points[idx] for idx in nn_indices] for nn_indices in all_nn_indices]
for nns in all_nns:
print(nns)
Outputs:
[(1, 1), (2, 2)]
[(1, 1), (2, 2), (3, 3)]
[(2, 2), (3, 3), (4, 4)]
[(3, 3), (4, 4), (5, 5)]
[(4, 4), (5, 5)]
Note that each point includes itself in its list of nearest neighbours within the given radius. If you want to remove these identity points, the line computing all_nns
can be changed to:
all_nns = [
[points[idx] for idx in nn_indices if idx != i]
for i, nn_indices in enumerate(all_nn_indices)
]
Resulting in:
[(2, 2)]
[(1, 1), (3, 3)]
[(2, 2), (4, 4)]
[(3, 3), (5, 5)]
[(4, 4)]
The sklearn should be the best. I wrote the below some time back ,where I needed custom distance. (I guess sklearn does not support custom distance fn 'KD tree' with custom distance metric . Adding for reference
Adapted from my gist for 2D https://gist.github.com/alexcpn/1f187f2114976e748f4d3ad38dea17e8
# From https://gist.github.com/alexcpn/1f187f2114976e748f4d3ad38dea17e8
# Author alex punnen
from collections import namedtuple
from operator import itemgetter
import numpy as np
def find_nearest_neighbour(node,point,distance_fn,current_axis):
# Algorith to find nearest neighbour in a KD Tree;the KD tree has done a spatial sort
# of the given co-ordinates, such that to the left of the root lies co-ordinates nearest to the x-axis
# and to the right of the root ,lies the co-ordinates farthest from the x axis
# On the y axis split on the left of the parent/root node lies co-ordinates nearest to the y-axis and to
# the right of the root, lies the co-ordinates farthest from the y axis
# to find the nearest neightbour, from the root, you first check left and right node; if distance is closer
# to the right node,then the entire left node can be discarded from search, because of the spatial split
# and that node becomes the root node. This process is continued recursively till the nearest is found
# param:node: The current node
# param: point: The point to which the nearest neighbour is to be found
# param: distance_fn: to calculate the nearest neighbour
# param: current_axis: here assuming only two dimenstion and current axis will be either x or y , 0 or 1
if node is None:
return None,None
current_closest_node = node
closest_known_distance = distance_fn(node.cell[0],node.cell[1],point[0],point[1])
print closest_known_distance,node.cell
x = (node.cell[0],node.cell[1])
y = point
new_node = None
new_closest_distance = None
if x[current_axis] > y[current_axis]:
new_node,new_closest_distance= find_nearest_neighbour(node.left_branch,point,distance_fn,
(current_axis+1) %2)
else:
new_node,new_closest_distance = find_nearest_neighbour(node.right_branch,point,distance_fn,
(current_axis+1) %2)
if new_closest_distance and new_closest_distance < closest_known_distance:
print 'Reset closest node to ',new_node.cell
closest_known_distance = new_closest_distance
current_closest_node = new_node
return current_closest_node,closest_known_distance
class Node(namedtuple('Node','cell, left_branch, right_branch')):
# This Class is taken from wikipedia code snippet for KD tree
pass
def create_kdtree(cell_list,current_axis,no_of_axis):
# Creates a KD Tree recursively following the snippet from wikipedia for KD tree
# but making it generic for any number of axis and changes in data strucure
if not cell_list:
return
# get the cell as a tuple list this is for 2 dimensions
k= [(cell[0],cell[1]) for cell in cell_list]
# say for three dimension
# k= [(cell[0],cell[1],cell[2]) for cell in cell_list]
k.sort(key=itemgetter(current_axis)) # sort on the current axis
median = len(k) // 2 # get the median of the list
axis = (current_axis + 1) % no_of_axis # cycle the axis
return Node(k[median], # recurse
create_kdtree(k[:median],axis,no_of_axis),
create_kdtree(k[median+1:],axis,no_of_axis))
def eucleaden_dist(x1,y1,x2,y2):
a= np.array([x1,y1])
b= np.array([x2,y2])
dist = np.linalg.norm(a-b)
return dist
np.random.seed(0)
#cell_list = np.random.random((2, 2))
#cell_list = cell_list.tolist()
cell_list = [[2,2],[4,8],[10,2]]
print(cell_list)
tree = create_kdtree(cell_list,0,2)
node,distance = find_nearest_neighbour(tree,(1, 1),eucleaden_dist,0)
print 'Nearest Neighbour=',node.cell,distance
node,distance = find_nearest_neighbour(tree,(8, 1),eucleaden_dist,0)
print 'Nearest Neighbour=',node.cell,distance