I have two sets of points, called path
and centers
. For each point in path
, I would like an efficient method for finding the ID of the closest point in centers
. I would like to do this in R. Below is a simple reproducible example.
set.seed(1)
n <- 10000
x <- 100*cumprod(1 + rnorm(n, 0.0001, 0.002))
y <- 50*cumprod(1 + rnorm(n, 0.0001, 0.002))
path <- data.frame(cbind(x=x, y=y))
centers <- expand.grid(x=seq(0, 500,by=0.5) + rnorm(1001),
y=seq(0, 500, by=0.2) + rnorm(2501))
centers$id <- seq(nrow(centers))
x
and y
are coordinates. I would like to add a column to the path
data.frame that has the id of the closest center for the given x and y co-ordinate. I then want to get all of the unique ids.
My solution at the moment does work, but is very slow when the scale of the problem increases. I would like something much more efficient.
path$closest.id <- sapply(seq(nrow(path)), function(z){
tmp <- ((centers$x - path[z, 'x'])^2) + ((centers$y - path[z, 'y'])^2)
as.numeric(centers[tmp == min(tmp), 'id'])
})
output <- unique(path$closest.id)
Any help on speeding this up would be greatly appreciated.
I think data.table
might help, but ideally what I am looking for is an algorithm that is perhaps a bit smarter in terms of the search, i.e. instead of calculating the distances to each center and then only selecting the minimum one... to get the id...
I am also happy to use Rcpp
/Rcpp11
as well if that would help improve performance.
My minimum acceptable time to perform this kind of calculation out would be 10 seconds, but obviously faster would be better.
You can do this with
nn2
from theRANN
package. On my system, this computes the nearestcenter
to each of yourpath
points in under 2 seconds.Here's another example with 2.5 million candidate points that all fall within the extent of the
path
points (in your example, thecenters
have a much largerx
andy
range than do thepath
points). It's a little slower in this case.This can be compared to the output using
sp::spDistsN1
(which is much slower for this problem):Adding the point id to the
path
data.frame and reducing to unique values is trivial:This solution reduces processing time for the sample dataset by almost half that achieved by the RANN solution.
It can be installed using
devtools::install_github("thell/Rcppnanoflann")
The Rcppnanoflann solution takes advantage of Rcpp, RcppEigen and the nanoflann EigenMatrixAdaptor along with the c++11 to yield identical unique indexes to the original question.
* using path and centers values as defined in the original question
To achieve identical results to the original sample the RANN solution needs slight modification which we time here...
The working function of Rcppnanoflann takes advantage of Eigen's Map capabilities to create the input for a fixed type Eigen matrix from the given
P
dataframe.Testing was done with the RcppParallel package but the kd_tree object does not have a copy constructor, so the tree needed to be created for each thread which ate up any gains in the parallel query processing.
RcppEigen and Rcpp11 currently don't play together so the idea of using Rcpp11's parallel sapply for the query isn't easily tested.
Here is an
Rcpp11
solution. Something similar might work withRcpp
with a few changes.I get :
This takes advantage of automatic parallelization of sugar, i.e.
sapply
is run in parallel. The#define RCPP11_PARALLEL_MINIMUM_SIZE 1000
part is to force the parallel, which is otherwise by default only kicked in from 10000. But in that case since the inner computation are time consuming, it's worth it.Note that you need a development version of
Rcpp11
becauseunique
is broken in the released version.