Fairly new to Parallel R. Quick question. I have an algorithm that is computationally intensive. Fortunately it can easily be broken up into pieces to make use of multicore
or snow
. What I would like to know is if it is considered fine in practice to use multicore
in conjunction with snow
?
What I would like to do is split up my load to run on multiple machines in a cluster and for each machine. I would like to utilize all cores on the machine. For this type of processing, is it reasonable to mix snow with multicore
?
I have used the approach suggested above by lockedoff, that is use the parallel package to distribute an embarrassingly parallel workload over multiple machines with multiple cores. First the workload is distributed over all machines and then the workload of each machine is distributed over all it's cores. The disadvantage of this approach is that there is no load balancing between machines (at least I don't know how).
All loaded r code should be the same and on the same location on all machines (svn). Because initializing the clusters takes quite some time, the code below can be improved by reusing the created clusters.
foo <- function(workload, otherArgumentsForFoo) {
source("/home/user/workspace/mycode.R")
...
}
distributedFooOnCores <- function(workload) {
# Somehow assign a batch number to every record
workload$ParBatchNumber = NA
# Split the assigned workload into batches according to DistrParNumber
batches = by(workload, workload$ParBatchNumber, function(x) x)
# Create a cluster with workers on all machines
library("parallel")
cluster = makeCluster(detectCores(), outfile="distributedFooOnCores.log")
batches = parLapply(cluster, batches, foo, otherArgumentsForFoo)
stopCluster(cluster)
# Merge the resulting batches
results = someEmptyDataframe
p = 1;
for(i in 1:length(batches)){
results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]]
p = p + nrow(batches[[i]])
}
# Clean up
workload$ParBatchNumber = NULL
return(invisible(results))
}
distributedFooOnMachines <- function(workload) {
# Somehow assign a batch number to every record
workload$DistrBatchNumber = NA
# Split the assigned activity into batches according to DistrBatchNumber
batches = by(workload, workload$DistrBatchNumber, function(x) x)
# Create a cluster with workers on all machines
library("parallel")
# If makeCluster hangs, please make sure passwordless ssh is configured on all machines
cluster = makeCluster(c("machine1", "etc"), master="ub2", user="", outfile="distributedFooOnMachines.log")
batches = parLapply(cluster, batches, foo, otherArgumentsForFoo)
stopCluster(cluster)
# Merge the resulting batches
results = someEmptyDataframe
p = 1;
for(i in 1:length(batches)){
results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]]
p = p + nrow(batches[[i]])
}
# Clean up
workload$DistrBatchNumber = NULL
return(invisible(results))
}
I'm interested how the approach above can be improved.