currently I'm using the build in function dist to calculate my distance matrix in R.
dist(featureVector,method="manhattan")
This is currently the bottlneck of the application and therefore the idea was to parallize this task(conceptually this should be possible)
Searching google and this forum did not succeed.
Does anybody has an idea?
Here's the structure for one route you could go. It is not faster than just using the
dist()
function, instead taking many times longer. It does process in parallel, but even if the computation time were reduced to zero, the time to start up the function and export the variables to the cluster would probably be longer than just usingdist()
The R package amap provides robust and parallelized functions for Clustering and Principal Component Analysis. Among these functions, Dist method offers what you are looking for: computes and returns the distance matrix in a parallel manner.
The code above compute euclidean distance with 8 threads.
I am a windows user looking for an efficient way to compute the distance matrix to use it in a hierarchical clustering (using the function hclust from the "stats" package for example). The function Dist doesn't work in parallel in Windows so I had to look for something different, and I found the "wordspace" package of Stefan Evert which contains the
dist.matrix
function. You can try this code:As you can see computing the distance matrix for a dataframe with 1000 binary features and 5000 instances is much faster with
dist.matrix
These are the results in my laptop (i7-6500U):
This solved my problem. Here you can check the original thread where I found it: http://r.789695.n4.nabble.com/Efficient-distance-calculation-on-big-matrix-td4633598.html
It doesn´t solve it in parallel but is enough in many occasions.
I've found parallelDist to be orders of magnitude faster than dist, and chewing up much less virtual memory in the process, on my Mac under Microsoft R Open 3.4.0. A word of warning though - I've had no luck compiling it on R 3.3.3. It doesn't list the version of R as a dependency but I suspect it is.
I am also working with somewhat large distance matrices and trying to speed-up the computation. Will Benson above is likely to be correct when he says that "the time to start up the function and export the variables to the cluster would probably be longer than just using".
However, I think this applies to distance matrices with small to moderate size. See the example bellow using the functions
Dist
from the package amap with 10 processors,dist
from the package stats, andrdist
from package fields, which calls a Fortran function. The first example creates a 400 x 400 distance matrix. The second creates a 3103 x 3103 distance matrix.Note how the computation time reduced from 0.09845328 secs to 0.05900002 secs using
Dist
compared todist
when the distance matrix was large (3103 x 3103). As such, I would suggest that you use functionDist
from the amap package provided you have several processors available.You can also use the
parDist
function of the parallelDist package, which is specifically built for parallelized distance matrix computations. Advantages are that the package is available on Mac OS, Windows and Linux and already supports 39 different distance measures (see parDist).Performance comparison for manhattan distance (Sys spec: Mac OS; Intel Core i7 with 4 cores @ 2,5 GHz and hyperthreading enabled):
With a larger matrix:
Further performance comparisons can be found in
parallelDist
's vignette.