可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

currently I'm using the build in function dist to calculate my distance matrix in R.

dist(featureVector,method="manhattan")

This is currently the bottlneck of the application and therefore the idea was to parallize this task(conceptually this should be possible)

Searching google and this forum did not succeed.

Does anybody has an idea?

回答1:

Here's the structure for one route you could go. It is not faster than just using the dist() function, instead taking many times longer. It does process in parallel, but even if the computation time were reduced to zero, the time to start up the function and export the variables to the cluster would probably be longer than just using dist()

library(parallel)

vec.array <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)

TaxiDistFun <- function(one.vec, whole.matrix) {
    diff.matrix <- t(t(whole.matrix) - one.vec)
    this.row <- apply(diff.matrix, 1, function(x) sum(abs(x)))
    return(this.row)
}

cl <- makeCluster(detectCores())
clusterExport(cl, list("vec.array", "TaxiDistFun"))

system.time(dist.array <- parRapply(cl, vec.array,
                        function(x) TaxiDistFun(x, vec.array)))

stopCluster(cl)

dim(dist.array) <- c(2000, 2000)

回答2:

The R package amap provides robust and parallelized functions for Clustering and Principal Component Analysis. Among these functions, Dist method offers what you are looking for: computes and returns the distance matrix in a parallel manner.

Dist(x, method = "euclidean", nbproc = 8)

The code above compute euclidean distance with 8 threads.

回答3:

You can also use the parDist function of the parallelDist package, which is specifically built for parallelized distance matrix computations. Advantages are that the package is available on Mac OS, Windows and Linux and already supports 39 different distance measures (see parDist).

Performance comparison for manhattan distance (Sys spec: Mac OS; Intel Core i7 with 4 cores @ 2,5 GHz and hyperthreading enabled):

library(parallelDist)
library(amap)
library(wordspace)
library(microbenchmark)

set.seed(123)
x <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)

microbenchmark(parDist(x, method = "manhattan"),
               Dist(x, method = "manhattan", nbproc = 8),
               dist.matrix(x, method = "manhattan"),
               times = 10)

Unit: milliseconds
                                      expr      min       lq     mean   median       uq      max neval
          parDist(x, method = "manhattan") 210.9478 214.3557 225.5894 221.3705 237.9829 247.0844    10
 Dist(x, method = "manhattan", nbproc = 8) 749.9397 755.7351 797.6349 812.6109 824.4075 844.1090    10
      dist.matrix(x, method = "manhattan") 256.0831 263.3273 279.0864 275.1882 296.3256 311.3821    10

With a larger matrix:

x <- matrix(rnorm(10000 * 100), nrow = 10000, ncol = 100)
microbenchmark(parDist(x, method = "manhattan"),
+                Dist(x, method = "manhattan", nbproc = 8),
+                dist.matrix(x, method = "manhattan"),
+                times = 10)
Unit: seconds
                                      expr       min        lq      mean    median        uq       max neval
          parDist(x, method = "manhattan")  6.298234  6.388501  6.737168  6.894203  6.947981  7.221661    10
 Dist(x, method = "manhattan", nbproc = 8) 22.722947 24.113681 24.326157 24.477034 24.658145 25.301353    10
      dist.matrix(x, method = "manhattan")  7.156861  7.505229  7.544352  7.567980  7.655624  7.800530    10

Further performance comparisons can be found in parallelDist's vignette.

回答4:

I am a windows user looking for an efficient way to compute the distance matrix to use it in a hierarchical clustering (using the function hclust from the "stats" package for example). The function Dist doesn't work in parallel in Windows so I had to look for something different, and I found the "wordspace" package of Stefan Evert which contains the dist.matrix function. You can try this code:

X <- data.frame(replicate(1000,sample(0:1,5000,rep=TRUE)))
system.time(d <- dist(X, method = "manhattan"))
system.time(d2 <- as.dist( dist.matrix(as.matrix(X), method="manhattan") ))

As you can see computing the distance matrix for a dataframe with 1000 binary features and 5000 instances is much faster with dist.matrix

These are the results in my laptop (i7-6500U):

> system.time(d <- dist(X, method = "manhattan"))
   user  system elapsed 
 151.79    0.04  152.59 
> system.time(d2 <- as.dist( dist.matrix(as.matrix(X), method="manhattan") ))
   user  system elapsed 
  19.19    0.22   19.56

This solved my problem. Here you can check the original thread where I found it: http://r.789695.n4.nabble.com/Efficient-distance-calculation-on-big-matrix-td4633598.html

It doesn´t solve it in parallel but is enough in many occasions.

回答5:

I am also working with somewhat large distance matrices and trying to speed-up the computation. Will Benson above is likely to be correct when he says that "the time to start up the function and export the variables to the cluster would probably be longer than just using".

However, I think this applies to distance matrices with small to moderate size. See the example bellow using the functions Dist from the package amap with 10 processors, dist from the package stats, and rdist from package fields, which calls a Fortran function. The first example creates a 400 x 400 distance matrix. The second creates a 3103 x 3103 distance matrix.

require(sp)
require(fields)
require(amap)
data(meuse.grid)
meuse.gridA <- meuse.grid[1:400, 1:2]
meuse.gridB <- meuse.grid[, 1:2]

# small distance matrix
a <- Sys.time()
invisible(dist(meuse.gridA, diag = TRUE, upper = TRUE))
Sys.time() - a
Time difference of 0.002138376 secs
a <- Sys.time()
invisible(Dist(meuse.gridA, nbproc = 10, diag = TRUE, upper = TRUE))
Sys.time() - a
Time difference of 0.005409241 secs
a <- Sys.time()
invisible(rdist(meuse.gridA))
Sys.time() - a
Time difference of 0.02312016 secs

# large distance matrix
a <- Sys.time()
invisible(dist(meuse.gridB, diag = TRUE, upper = TRUE))
Sys.time() - a
Time difference of 0.09845328 secs
a <- Sys.time()
invisible(Dist(meuse.gridB, nbproc = 10, diag = TRUE, upper = TRUE))
Sys.time() - a
Time difference of 0.05900002 secs
a <- Sys.time()
invisible(rdist(meuse.gridB))
Sys.time() - a
Time difference of 0.8928168 secs

Note how the computation time reduced from 0.09845328 secs to 0.05900002 secs using Dist compared to dist when the distance matrix was large (3103 x 3103). As such, I would suggest that you use function Dist from the amap package provided you have several processors available.

回答6:

I've found parallelDist to be orders of magnitude faster than dist, and chewing up much less virtual memory in the process, on my Mac under Microsoft R Open 3.4.0. A word of warning though - I've had no luck compiling it on R 3.3.3. It doesn't list the version of R as a dependency but I suspect it is.