Combining Multicore with Snow Cluster

2019-04-02 04:48发布

问题:

Fairly new to Parallel R. Quick question. I have an algorithm that is computationally intensive. Fortunately it can easily be broken up into pieces to make use of multicore or snow. What I would like to know is if it is considered fine in practice to use multicore in conjunction with snow?

What I would like to do is split up my load to run on multiple machines in a cluster and for each machine. I would like to utilize all cores on the machine. For this type of processing, is it reasonable to mix snow with multicore?

回答1:

I have used the approach suggested above by lockedoff, that is use the parallel package to distribute an embarrassingly parallel workload over multiple machines with multiple cores. First the workload is distributed over all machines and then the workload of each machine is distributed over all it's cores. The disadvantage of this approach is that there is no load balancing between machines (at least I don't know how).

All loaded r code should be the same and on the same location on all machines (svn). Because initializing the clusters takes quite some time, the code below can be improved by reusing the created clusters.

foo <- function(workload, otherArgumentsForFoo) {
    source("/home/user/workspace/mycode.R")
    ...
}

distributedFooOnCores <- function(workload) {
    # Somehow assign a batch number to every record
    workload$ParBatchNumber = NA
    # Split the assigned workload into batches according to DistrParNumber
    batches = by(workload, workload$ParBatchNumber, function(x) x)

    # Create a cluster with workers on all machines 
    library("parallel")
    cluster = makeCluster(detectCores(), outfile="distributedFooOnCores.log")
    batches = parLapply(cluster, batches, foo, otherArgumentsForFoo)
    stopCluster(cluster)

    # Merge the resulting batches
    results = someEmptyDataframe
    p = 1;
    for(i in 1:length(batches)){
        results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]]
        p = p + nrow(batches[[i]])      
    }

    # Clean up
    workload$ParBatchNumber = NULL
    return(invisible(results))
}

distributedFooOnMachines <- function(workload) {
    # Somehow assign a batch number to every record
    workload$DistrBatchNumber = NA
    # Split the assigned activity into batches according to DistrBatchNumber
    batches = by(workload, workload$DistrBatchNumber, function(x) x)

    # Create a cluster with workers on all machines 
    library("parallel")
    # If makeCluster hangs, please make sure passwordless ssh is configured on all machines
    cluster = makeCluster(c("machine1", "etc"), master="ub2", user="", outfile="distributedFooOnMachines.log")
    batches = parLapply(cluster, batches, foo, otherArgumentsForFoo)
    stopCluster(cluster)

    # Merge the resulting batches
    results = someEmptyDataframe
    p = 1;
    for(i in 1:length(batches)){
        results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]]
        p = p + nrow(batches[[i]])      
    }

    # Clean up
    workload$DistrBatchNumber = NULL
    return(invisible(results))
}

I'm interested how the approach above can be improved.