MPI parallelization using SNOW is slow

My foray into parallelization continues. I initially had difficulty installing Rmpi, but I got that going (I needed to sudo apt-get it). I should say that I'm running a machine with Ubuntu 10.10.

I ran the same simulation as my previous question. Recall the system times for the unclustered and SNOW SOCK cluster respectively:

> system.time(CltSim(nSims=10000, size=100))
   user  system elapsed 
  0.476   0.008   0.484 
> system.time(ParCltSim(cluster=cl, nSims=10000, size=100))
   user  system elapsed 
  0.028   0.004   0.375

Now, using an MPI cluster, I get a speed slowdown relative to not clustering:

> stopCluster(cl)
> cl <- getMPIcluster()
> system.time(ParCltSim(cluster=cl, nSims=10000, size=100))
   user  system elapsed 
  0.088   0.196   0.604

Not sure whether this is useful, but here's info on the cluster created:

> cl
[[1]]
$rank
[1] 1

$RECVTAG
[1] 33

$SENDTAG
[1] 22

$comm
[1] 1

attr(,"class")
[1] "MPInode"

[[2]]
$rank
[1] 2

$RECVTAG
[1] 33

$SENDTAG
[1] 22

$comm
[1] 1

attr(,"class")
[1] "MPInode"

attr(,"class")
[1] "spawnedMPIcluster" "MPIcluster"        "cluster"

Any idea about what could be going on here? Thanks for your assistance as I try out these parallelization options.

Cordially, Charlie

It's a bit the same as with your other question : the communication between the nodes in the cluster is taking up more time than the actual function.

This can be illustrated by changing your functions :

library(snow)
cl <- makeCluster(2)

SnowSim <- function(cluster, nSims=10,n){
  parSapply(cluster, 1:nSims, function(x){
    Sys.sleep(n)
    x
  })
}



library(foreach)
library(doSNOW)
registerDoSNOW(cl)

ForSim <- function(nSims=10,n) {
  foreach(i=1:nSims, .combine=c) %dopar% {
      Sys.sleep(n)
      i
  }
}

This way we can simulate a long-calculating and a short-calculating function in different numbers of simulations. Let's take two cases, one where you have 1 sec calculation and 10 loops, and one with 1ms calculation and 10000 loops. Both should last 10sec :

> system.time(SnowSim(cl,10,1))
   user  system elapsed 
      0       0       5 

> system.time(ForSim(10,1))
   user  system elapsed 
   0.03    0.00    5.03 

> system.time(SnowSim(cl,10000,0.001))
   user  system elapsed 
   0.02    0.00    9.78 

> system.time(ForSim(10000,0.001))
   user  system elapsed 
  10.04    0.00   19.81

Basically what you see, is that for long-calculating functions and low simulations, the parallelized versions cleanly cut the calculation time in half as expected.

Now the simulations you do are of the second case. There you see that the snow solution doesn't really make a difference any more, and that the foreach solution even needs twice as much. This is simply due to the overhead of communication to and between nodes, and handling of the data that gets returned. The overhead of foreach is a lot bigger than with snow, as shown in my answer to your previous question.

I didn't fire up my Ubuntu to try with an MPI cluster, but it's basically the same story. There are subtle differences between the different types of clusters according to time needed for communication, partly due to differences between the underlying packages.