Recalculating distance matrix

2019-04-02 21:31发布

问题:

I’ve got a large input matrix (4000x10000). I use dist() to calculate the Euclidean distance matrix for it (it takes about 5 hours).
I need to calculate the distance matrix for the "same" matrix with an additional row (for a 4001x10000 matrix). What is the fastest way to determine the distance matrix without recalculating the whole matrix?

回答1:

I'll assume your extra row means an extra point. If it means an extra variable/dimension, it will call for a different answer.

First of all, for euclidean distance of matrices, I'd recommend the rdist function from the fields package. It is written in Fortran and is a lot faster than the dist function. It returns a matrix instead of a dist object, but you can always go from one to the other using as.matrix and as.dist.

Here is (smaller than yours) sample data

num.points <- 400
num.vars   <- 1000
original.points <- matrix(runif(num.points * num.vars),
                          nrow = num.points, ncol = num.vars)

and the distance matrix you already computed:

d0 <- rdist(original.points)

For the extra point(s), you only need to compute the distances among the extra points and the distances between the extra points and the original points. I will use two extra points to show that the solution is general to any number of extra points:

extra.points <- matrix(runif(2 * num.vars), nrow = 2)
inner.dist   <- rdist(extra.points)
outer.dist   <- rdist(extra.points, original.points)

so you can bind them to your bigger distance matrix:

d1 <- rbind(cbind(d0, t(outer.dist)),
            cbind(outer.dist, inner.dist))

Let's check that it matches what a full, long rerun would have produced:

d2 <- rdist(rbind(original.points, extra.points))

identical(d1, d2)
# [1] TRUE


标签: r distance