I have a list L of ~30k locations (written as longitude/latitude pairs), and a list E of ~1m events (with locations written as longitude/latitude pairs), each of which occurs at a point in L. I want to tag each event in E with its corresponding location in L. But the cooordinates in L and E are rounded differently—E to five decimal places, L to thirteen—so ostensibly identical coordinates can actually differ by ~10^-5 degrees, or ~1 meter. (Points in L are separated by at least ~10 m.)
I thus need the nearest point in L to every point of E; the obvious O(|L||E|) brute-force algorithm is too slow. L is small enough compared to E that algorithms that preprocess L and amortize the preprocessing time over E are fine. Is this a well-studied problem? The links that I can find are for related but distinct problems, like finding the minimal distance between a pair of points in one set.
Possibly relevant: Voronoi diagrams, though I can't see how preprocessing L into a Voronoi diagram would save me computational time.
This is amenable to a fairly direct application of space-partitioning. Most GIS libraries should provide tools to do this.
My own experience is restricted to using R-Trees. I create an index of the L items. You can use the points directly or a bounding box representing the uncertainty around the point. Then the R-tree supports an efficient (log(L) time) query of the n nearest neighbouring points/bounding boxes.
The implementation I have used successfully is libspatialindex wrapped into a python library called Rtree.
However, note that this specific tool is accurate only when using Euclidean x,y coordinates. It has significant potential for error using lat longs away from the equator, and particularly if you want to cover a geographically large area. In my case I was restricted to a single country within which I used eastings/northings. I'm not aware of which libraries provide support for handling the problem using great circle distances but they certainly ought to be available.
From your description:
Solution: Round coordinates in L to the same precision of E and match equal pairs.
Explanation: Rounding effectively maps each point to its nearest neighbor on a square grid. As long as the grid resolution (~1 meter when rounding to five decimals) is lower than half the minimal distance between two points in L (~10/2 meters), you are good to go.
For maximum performance and exploiting |L| << |E|, you might want to build a hash map over the rounded coordinates in L and match each point in E in near constant time. Total runtime would then be limited by O(|E|).
My library GeographicLib contains a class NearestNeighbor which implements the vantage-point trees. This works for any data where there's a true distance metric; the geodesic distance obviously satisfies this criterion. Basically you initialize the class with your
L
locations and then for each ofE
events you ask for the nearest location. The breakdown on the computational costs is as follows(These are base-2 logs.) Here's code that will do the lookup. Running this on 30k locations and 1M events takes about 40 secs and involved a total of 16M geodesic distance calculations. (The brute force way would take about 21 hours.)
Yes you are correct. First you can construct the Voronoi Diagram of your point set of locations L in O(|L| log |L|) time using Furtune's Sweep Line approach. There are various implementations out there that can be used, Triangle would be among the most common ones.
Now you have a partitioning of the plane of O(|L|) size. To allow O(log |L|) nearest neighbour queries you need a search structure on top of the Voronoi diagram. A common approach is to use the Dobkin-Kirkpatrick Hierarchy, details can be found in various lecture notes. This method supports O(log |L|) queries and also requires only O(|L|) size. (Also mentioned in this post.)
Then the |E| queries can be accomplished in O(|E| log |L|) time.
A different approach would be to use k-d trees. From an implementation point of view they might be less work and provide the same complexities (as far as iI know). A quick search revealed these two implementations that are maybe worth testing: C++, Java.