Efficient algorithm for finding closest point in a

2019-03-02 02:01发布

I have a list L of ~30k locations (written as longitude/latitude pairs), and a list E of ~1m events (with locations written as longitude/latitude pairs), each of which occurs at a point in L. I want to tag each event in E with its corresponding location in L. But the cooordinates in L and E are rounded differently—E to five decimal places, L to thirteen—so ostensibly identical coordinates can actually differ by ~10^-5 degrees, or ~1 meter. (Points in L are separated by at least ~10 m.)

I thus need the nearest point in L to every point of E; the obvious O(|L||E|) brute-force algorithm is too slow. L is small enough compared to E that algorithms that preprocess L and amortize the preprocessing time over E are fine. Is this a well-studied problem? The links that I can find are for related but distinct problems, like finding the minimal distance between a pair of points in one set.

Possibly relevant: Voronoi diagrams, though I can't see how preprocessing L into a Voronoi diagram would save me computational time.

4条回答
爷的心禁止访问
2楼-- · 2019-03-02 02:25

This is amenable to a fairly direct application of space-partitioning. Most GIS libraries should provide tools to do this.

My own experience is restricted to using R-Trees. I create an index of the L items. You can use the points directly or a bounding box representing the uncertainty around the point. Then the R-tree supports an efficient (log(L) time) query of the n nearest neighbouring points/bounding boxes.

The implementation I have used successfully is libspatialindex wrapped into a python library called Rtree.

However, note that this specific tool is accurate only when using Euclidean x,y coordinates. It has significant potential for error using lat longs away from the equator, and particularly if you want to cover a geographically large area. In my case I was restricted to a single country within which I used eastings/northings. I'm not aware of which libraries provide support for handling the problem using great circle distances but they certainly ought to be available.

查看更多
时光不老,我们不散
3楼-- · 2019-03-02 02:28

From your description:

  • Each point in E is identical to a point in L rounded to five decimal places. Thus, each point in E deviates up to ~1 meter from its match in L.
  • Points in L are separated by at least ~10 meter.

Solution: Round coordinates in L to the same precision of E and match equal pairs.

Explanation: Rounding effectively maps each point to its nearest neighbor on a square grid. As long as the grid resolution (~1 meter when rounding to five decimals) is lower than half the minimal distance between two points in L (~10/2 meters), you are good to go.

For maximum performance and exploiting |L| << |E|, you might want to build a hash map over the rounded coordinates in L and match each point in E in near constant time. Total runtime would then be limited by O(|E|).

查看更多
女痞
4楼-- · 2019-03-02 02:40

My library GeographicLib contains a class NearestNeighbor which implements the vantage-point trees. This works for any data where there's a true distance metric; the geodesic distance obviously satisfies this criterion. Basically you initialize the class with your L locations and then for each of E events you ask for the nearest location. The breakdown on the computational costs is as follows

set-up cost: L log L
cost per query: log L
cost of all queries: E log L
total cost: (L + E) log L

(These are base-2 logs.) Here's code that will do the lookup. Running this on 30k locations and 1M events takes about 40 secs and involved a total of 16M geodesic distance calculations. (The brute force way would take about 21 hours.)

// Read lat/lon locations from locations.txt and lat/lon events from
// events.txt.  For each event print to closest.txt: the index for the
// closest location and the distance to it.

// Needs GeographicLib 1.47 or later

// compile and link with
// g++ -o closest closest.cpp -lGeographic \
//   -L/usr/local/lib -Wl,-rpath=/usr/local/lib

#include <iostream>
#include <vector>
#include <fstream>
#include <GeographicLib/NearestNeighbor.hpp>
#include <GeographicLib/Geodesic.hpp>

using namespace std;
using namespace GeographicLib;

// A structure to hold a geographic coordinate.
struct pos {
  double lat, lon;
  pos(double lat = 0, double lon = 0)
    : lat(lat), lon(lon) {}
};

// A class to compute the distance between 2 positions.
class DistanceCalculator {
private:
  Geodesic _geod;
public:
  explicit DistanceCalculator(const Geodesic& geod)
    : _geod(geod) {}
  double operator() (const pos& a, const pos& b) const {
    double s12;
    _geod.Inverse(a.lat, a.lon, b.lat, b.lon, s12);
    return s12;
  }
};

int main() {
  // Define a distance function object
  DistanceCalculator distance(Geodesic::WGS84());

  // Read in locations
  vector<pos> locs;
  {
    double lat, lon;
    ifstream is("locations.txt");
    while (is >> lat >> lon)
      locs.push_back(pos(lat, lon));
  }

  // Construct NearestNeighbor object
  NearestNeighbor<double, pos, DistanceCalculator>
    locationset(locs, distance);

  ifstream is("events.txt");
  ofstream os("closest.txt");

  double lat, lon, d;
  vector<int> k;
  while (is >> lat >> lon) {
    pos event(lat, lon);
    d = locationset.Search(locs, distance, event, k);
    os << k[0] << " " << d << "\n";
  }
}
查看更多
冷血范
5楼-- · 2019-03-02 02:47

Yes you are correct. First you can construct the Voronoi Diagram of your point set of locations L in O(|L| log |L|) time using Furtune's Sweep Line approach. There are various implementations out there that can be used, Triangle would be among the most common ones.

Now you have a partitioning of the plane of O(|L|) size. To allow O(log |L|) nearest neighbour queries you need a search structure on top of the Voronoi diagram. A common approach is to use the Dobkin-Kirkpatrick Hierarchy, details can be found in various lecture notes. This method supports O(log |L|) queries and also requires only O(|L|) size. (Also mentioned in this post.)

Then the |E| queries can be accomplished in O(|E| log |L|) time.

A different approach would be to use k-d trees. From an implementation point of view they might be less work and provide the same complexities (as far as iI know). A quick search revealed these two implementations that are maybe worth testing: C++, Java.

查看更多
登录 后发表回答