Efficiently checking Euclidean distance for a larg

In a route planning algorithm, I'm trying to perform a filter on a list of nodes based on distance to another node. I'm actually pulling the lists from a crude scene graph. I use the term "cell" to refer to a volume within a simple scenegraph from which we've fetched a list of nodes that are close to each other.

Right now, I'm implementing this as:

# SSCCE version of the core function
def nodes_in_range(src, cell, maxDist):
    srcX, srcY, srcZ = src.x, src.y, src.z
    maxDistSq = maxDist ** 2
    for node in cell:
        distSq = (node.x - srcX) ** 2
        if distSq > maxDistSq: continue
        distSq += (node.y - srcY) ** 2
        if distSq > maxDistSq: continue
        distSq += (node.z - srcZ) ** 2
        if distSq <= maxDistSq:
            yield node, distSq ** 0.5  # fast sqrt

from collections import namedtuple
class Node(namedtuple('Node', ('ID', 'x', 'y', 'z'))):
    # actual class has assorted other properties
    pass

# 1, 3 and 9 are <= 4.2 from Node(1)
cell = [
    Node(1, 0, 0, 0),
    Node(2, -2, -3, 4),
    Node(3, .1, .2, .3),
    Node(4, 2.3, -3.3, -4.5),
    Node(5, -2.5, 4.5, 5),
    Node(6, 4, 3., 2.),
    Node(7, -2.46, 2.46, -2.47),
    Node(8, 2.45, -2.46, -2.47),
    Node(9, .5, .5, .1),
    Node(10, 5, 6, 7),
    # In practice, cells have upto 600 entries
]

if __name__ == "__main__":
    for node, dist in nodes_in_range(cell[0], cell, 4.2):
        print("{:3n} {:5.2f}".format(node.ID, dist))

This routine is being called a lot (10^7+ times in some queries), so each bit of perf matters, and avoiding the memberwise lookup with conditionals actually helped.

What I'm trying to do is switch to numpy and organize the cells so that I can vectorize. What I want to achieve is this:

import numpy
import numpy.linalg
contarry = numpy.ascontiguousarray
float32 = numpy.float32

# The "np_cell" has two arrays: one is the list of nodes and the
# second is a vectorizable array of their positions.
# np_cell[N][1] == numpy array position of np_cell[N][0]

def make_np_cell(cell):
    return (
        cell,
        contarry([contarry((node.x, node.y, node.z), float32) for node in cell]),
     )

# This version fails because norm returns a single value.
def np_nodes_in_range1(srcPos, np_cell, maxDist):
    distances = numpy.linalg.norm(np_cell[1] - srcPos)

    for (node, dist) in zip(np_cell[0], distances):
        if dist <= maxDist:
            yield node, dist

# This version fails because 
def np_nodes_in_range2(srcPos, np_cell, maxDist):
    # this will fail because the distances are wrong
    distances = numpy.linalg.norm(np_cell[1] - srcPos, ord=1, axis=1)
    for (node, dist) in zip(np_cell[0], distances):
        if dist <= maxDist:
            yield node, dist

# This version doesn't vectorize and so performs poorly
def np_nodes_in_range3(srcPos, np_cell, maxDist):
    norm = numpy.linalg.norm
    for (node, pos) in zip(np_cell[0], np_cell[1]):
        dist = norm(srcPos - pos)
        if dist <= maxDist:
            yield node, dist

if __name__ == "__main__":
    np_cell = make_np_cell(cell)
    srcPos = np_cell[1][0]  # Position column [1], first node [0]
    print("v1 - fails because it gets a single distance")
    try:
        for node, dist in np_nodes_in_range1(srcPos, np_cell, float32(4.2)):
            print("{:3n} {:5.2f}".format(node.ID, dist))
    except TypeError:
        print("distances was a single value")

    print("v2 - gets the wrong distance values")
    for node, dist in np_nodes_in_range2(srcPos, np_cell, float32(4.2)):
        print("{:3n} {:5.2f}".format(node.ID, dist))

    print("v3 - slower")
    for node, dist in np_nodes_in_range3(srcPos, np_cell, float32(4.2)):
        print("{:3n} {:5.2f}".format(node.ID, dist))

Combined whole is here - I included a v4 which tries using enumerate instead of zip and finds that it's about 12us slower.

Sample output:

  1  0.00
  3  0.37
  9  0.71
v1 - fails because it gets a single distance
distances was a single value
v2 - gets the wrong distance values
  1  0.00
  3  0.60
  9  1.10
v3 - slower
  1  0.00
  3  0.37
  9  0.71
v4 - v2 using enumerate
  1  0.00
  3  0.60
  9  1.10

As for performance, we can test this with timeit. I'll beef up the number of nodes in the cell with a simple multiplication:

In [2]: from sscce import *
In [3]: cell = cell * 32   # increase to 320 nodes
In [4]: len(cell)
Out[4]: 320
In [5]: %timeit -n 1000 -r 7 sum(1 for _ in nodes_in_range(cell[0], cell, 4.2))
1000 loops, best of 7: 742 µs per loop
In [6]: np_cell = make_np_cell(cell)
In [7]: srcPos = np_cell[1][0]
In [8]: %timeit -n 1000 -r 7 sum(1 for _ in np_nodes_in_range2(srcPos, np_cell, numpy.float32(4.2)))
1000 loops, best of 7: 136 µs per loop
In [9]: %timeit -n 1000 -r 7 sum(1 for _ in np_nodes_in_range3(srcPos, np_cell, numpy.float32(4.2)))
1000 loops, best of 7: 3.64 ms per loop

Highlights:

nodes_in_range
    1000 loops, best of 7: 742 µs per loop

np_nodes_in_range2
    1000 loops, best of 7: 136 µs per loop

np_nodes_in_range3
    1000 loops, best of 7: 3.64 ms per loop # OUCH

Questions:

What am I doing wrong with the vectorized distance calculation?

distances = numpy.linalg.norm(np_cell[1] - srcPos)

distances = numpy.linalg.norm(np_cell[1] - srcPos, ord=1, axis=1)

Is this the best approach?
Cell population varies, between a few nodes and hundreds. I currently iterate over cells, but it seems likely that I want to marshal a full set of candidates (nodes[], positions[]) despite the potential extra cost of building the list for this (I could always use a batch accumulator so that I always try and fill the accumulator with, say, at least 1024 positions before draining). But I think this thinking is shaped by my use of contiguous array. Should I be looking towards something like:
```
nodes_in_range(src, chain(cell.nodes for cell in scene if cell_in_range(boundingBox)))
```

and not worrying about trying to flatten the whole thing?

What am I doing wrong with the vectorized distance calculation?
distances = numpy.linalg.norm(np_cell[1] - srcPos)
vs
distances = numpy.linalg.norm(np_cell[1] - srcPos, ord=1, axis=1)

Firstly, if axis=None, np.linalg.norm will compute either the vector norm (if the input is 1D) or the matrix norm (if the input is multidimensional). Both of these are scalar quantities.

Secondly, ord=1 means the L1 norm (i.e. Manhattan distance), not Euclidean distance, as you mention in the title.

Is this the best approach?

A k-D tree would probably be much faster. You can use scipy.spatial.cKDTree to do a ball search to find nodes that are within some threshold distance from a query point:

import numpy as np
from scipy.spatial import cKDTree

# it will be much easier (and faster) to deal with numpy arrays here (you could
# always look up the corresponding node objects by index if you wanted to)
X = np.array([(n.x, n.y, n.z) for n in cell])

# construct a k-D tree
tree = cKDTree(X)

# query it with the first point, find the indices of all points within a maximum
# distance of 4.2 of the query point
query_point = X[0]
idx = tree.query_ball_point(query_point, r=4.2, p=2)

# these indices are one out from yours, since they start at 0 rather than 1
print(idx)
# [0, 2, 8]

# query_ball_point doesn't return the distances, but we can easily compute these
# using broadcasting
neighbor_points = X[idx]

d = np.sqrt(((query_point[None, :] - neighbor_points) ** 2).sum(1))
print(d)
# [ 0.          0.37416574  0.71414284]

Benchmarking:

Querying a cKDTree is blisteringly fast, even for very large numbers of points:

X = np.random.randn(10000000, 3)
tree = cKDTree(X)

%timeit tree.query_ball_point(np.random.randn(3), r=4.2)
# 1 loops, best of 3: 229 ms per loop

As you mentioned in the comments, the example above is a much harsher test of performance than your data. Because of the distance tolerance chosen, and the fact that the data is Gaussian (and therefore clustered around 0), it matches about 99% of the 10m points.

Here's a test on uniform data with a stricter distance cutoff that matches about 30% of the points, as in your example:

%timeit tree.query_ball_point((0., 0., 0.), r=1.2)
# 10 loops, best of 3: 86 ms per loop

Obviously, this is much larger number of points than you are using. For your example data:

tree = cKDTree(np_cell[1])
%timeit tree.query_ball_point(np_cell[1][0], r=4.2)
# The slowest run took 4.26 times longer than the fastest. This could mean that an intermediate result is being cached 
# 100000 loops, best of 3: 16.9 µs per loop

This comfortably beats your np_nodes_in_range2 function on my machine:

%timeit sum(1 for _ in np_nodes_in_range2(srcPos, np_cell, numpy.float32(4.2)))
# The slowest run took 7.77 times longer than the fastest. This could mean that an intermediate result is being cached 
# 10000 loops, best of 3: 84.4 µs per loop

Other things to consider:

If you need to query many points simultaneously, it is more efficient to construct a second tree and use query_ball_tree instead of query_ball_point:

X = np.random.randn(100, 3)
Y = np.random.randn(10, 3)
tree1 = cKDTree(X)
tree2 = cKDTree(Y)

# indices contains a list-of-lists, where the ith sublist contains the indices
# of the neighbours of Y[i] in X
indices = tree2.query_ball_tree(tree1, r=4.2)

If you don't care about the indices, and just want the number of points in the ball, it will probably be faster to use count_neighbours:

n_neighbors = tree2.count_neighbors(tree1, r=4.2)