I've got millions of geographic points. For each one of these, I want to find all "neighboring points," i.e., all other points within some radius, say a few hundred meters.
There is a naive O(N^2) solution to this problem---simply calculate the distance of all pairs of points. However, because I'm dealing with a proper distance metric (geographic distance), there should be a quicker way to do this.
I would like to do this within python. One solution that comes to mind is to use some database (mySQL with GIS extentions, PostGIS) and hope that such a database would take care of efficiently performing the operation described above using some index. I would prefer something simpler though, that doesn't require me to build and learn about such technologies.
A couple of points
- I will perform the "find neighbors" operation millions of times
- The data will remain static
- Because the problem is in a sense simple, I'd like to see they python code that solves it.
Put in terms of python code, I want something along the lines of:
points = [(lat1, long1), (lat2, long2) ... ] # this list contains millions lat/long tuples
points_index = magical_indexer(points)
neighbors = []
for point in points:
point_neighbors = points_index.get_points_within(point, 200) # get all points within 200 meters of point
neighbors.append(point_neighbors)
scipy
First things first: there are preexisting algorithms to do things kind of thing, such as the k-d tree. Scipy has a python implementation cKDtree that can find all points in a given range.
Binary Search
Depending on what you're doing however, implementing something like that may be nontrivial. Furthermore, creating a tree is fairly complex (potentially quite a bit of overhead), and you may be able to get away with a simple hack I've used before:
Effectively, you're doing O(N log(N)) preprocessing, and for each point roughly o(sqrt(N)) - or more, if the distribution of your points is poor. If the points are roughly uniformly distributed, the number of points nearer in X than the nearest neighbor will be on the order of the square root of N. It's less efficient if many points are within your range, but never much worse than brute force.
One advantage of this method is that's it all executable in very few memory allocations, and can mostly be done with very good memory locality, which means that it performs quite well despite the obvious limitations.
Delauney triangulation
Another idea: a Delauney triangulation could work. For the Delauney triangulation, it's given that any point's nearest neighbor is an adjacent node. The intuition is that during a search, you can maintain a heap (priority queue) based on absolute distance from query point. Pick the nearest point, check that it's in range, and if so add all its neighbors. I suspect that it's impossible to miss any points like this, but you'd need to look at it more carefully to be sure...
Tipped off by Eamon, I've come up with a simple solution using btrees implemented in SciPy.