Recalculating distance matrix

I’ve got a large input matrix (4000x10000). I use dist() to calculate the Euclidean distance matrix for it (it takes about 5 hours).
I need to calculate the distance matrix for the "same" matrix with an additional row (for a 4001x10000 matrix). What is the fastest way to determine the distance matrix without recalculating the whole matrix?

I'll assume your extra row means an extra point. If it means an extra variable/dimension, it will call for a different answer.

First of all, for euclidean distance of matrices, I'd recommend the rdist function from the fields package. It is written in Fortran and is a lot faster than the dist function. It returns a matrix instead of a dist object, but you can always go from one to the other using as.matrix and as.dist.

Here is (smaller than yours) sample data

num.points <- 400
num.vars   <- 1000
original.points <- matrix(runif(num.points * num.vars),
                          nrow = num.points, ncol = num.vars)

and the distance matrix you already computed:

d0 <- rdist(original.points)

For the extra point(s), you only need to compute the distances among the extra points and the distances between the extra points and the original points. I will use two extra points to show that the solution is general to any number of extra points:

extra.points <- matrix(runif(2 * num.vars), nrow = 2)
inner.dist   <- rdist(extra.points)
outer.dist   <- rdist(extra.points, original.points)

so you can bind them to your bigger distance matrix:

d1 <- rbind(cbind(d0, t(outer.dist)),
            cbind(outer.dist, inner.dist))

Let's check that it matches what a full, long rerun would have produced:

d2 <- rdist(rbind(original.points, extra.points))

identical(d1, d2)
# [1] TRUE

Recalculating distance matrix

问题:

回答1:

收藏的人(0)

Recalculating distance matrix

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮