-->

R: get list and environment of all variables and f

2019-07-31 11:51发布

问题:

I am using foreach for parallel processing, which requires manual passing of functions via a list to the environments of addressed cores. I want to automate this process and cover all use cases. Easy for simple functions which use only enclosed variables. Complications however as soon as functions which are to be parallel processed are using arguments and variables that are defined in another environment. Consider the following case:

global.variable <- 3

global.function <-function(j){
  res <- j^2
  return(res)
}

compute.in.parallel <-function(i){
  res <- global.function(i+global.variable)
  return(res)
}

pop <- seq(10)

do <- function(pop,fun){
  require(doParallel)
  require(foreach)
  cl <- makeCluster(16)
  registerDoParallel(cl)
  clusterExport(cl,list("global.variable","global.function"),envir=globalenv())
  results <- foreach(i=pop) %dopar% fun(i)
  stopCluster(cl)
  return(results)
}

do(pop,compute.in.parallel)

this works because I manually pass the global.variable and global.function to the cores as well (note that compute.in.parallel itself is automatically considered within the scope): clusterExport(cl,list("global.variable","global.function"),envir=globalenv())

but I want to do it automatically - requiring to build a string of all variables and functions which are used (but not defined/passed/contained) within compute.in.parallel. How do I do this?

My current workaround is dump all available variables to the cores:

clusterExport(cl,as.list(unique(c(ls(.GlobalEnv),ls(environment())))),envir=environment())

This is however non-satisfactory - I am not considering variables in package namespaces and other hidden environments as well as generally passing way too many variables to the cores, creating significant overhead with every parallel run.

Any suggested improvements?

回答1:

Just pass all arguments that are needed in do(), rather than using global variables.

compute.in.parallel <- function(i, global.variable, global.function) {
  global.function(i + global.variable)
}

do <- function(pop, fun, ncores = parallel::detectCores() - 1, ...) {
  require(foreach)
  cl <- parallel::makeCluster(ncores)
  on.exit(parallel::stopCluster(cl), add = TRUE)
  doParallel::registerDoParallel(cl)
  foreach(i = pop) %dopar% fun(i, ...)
}

do(seq(10), compute.in.parallel, 
   global.variable = 3, 
   global.function = function(j) j^2)


回答2:

The future framework automatically identifies and exports globals by default. The doFuture package provides a generic future backend adaptor for foreach. If you use that, the following works:

do <- function(pop, fun) {
  library("doFuture")
  registerDoFuture()
  cl <- parallel::makeCluster(2)
  old_plan <- plan(cluster, workers = cl)
  on.exit({
    plan(old_plan)
    parallel::stopCluster(cl)
  })

  foreach(i = pop) %dopar% fun(i)
}