Future solutions

2019-07-28 16:56发布

问题:

I am working with a large data set, which I use to make certain calculations. Since it is a huge data set, my machine, I am working on, is doing the job excessively long, for this reason I decided to use the future package in order to distribute the work between several machines and speed up the calculations. So, my problem is that through the future (using putty & ssh) I can connect to those machines (in parallel), but the work itself is doing the main one, without any distribution. Maybe you can advice some solution:

  • How to make it work in all machines;
  • As well, how to check if the process is working (I mean some function or anything that could help to verify the functionment functionality of those, ofc if it's existing).

My code:

library(future)
workers <- c("000.000.0.000", "111.111.1.111")
plan(remote, envir = parent.frame(), workers= workers, myip = "222.222.2.22")
start <- proc.time()
cl <- makeClusterPSOCK(
 c("000.000.0.000", "111.111.1.111"), user = "...", 
rshcmd = c("plink", "-ssh", "-pw",  "..."),  
rshopts = c("-i", "V:\\vbulavina\\privatekey.ppk"),
homogeneous = FALSE))
setwd("V:/vbulavina/r/inversion")
a <- source("fun.r")
f <- future({source("pasos.r")})
l <- future({source("pasos2.R")})
time_elapsed_parallel <- proc.time() - start
time_elapsed_parallel

f and l objects are supposed to be done in parallel, but the master machine is doing all the job, so I'm a bit confused if i can do something concerning it.

PS: I tried plan() with remote, multiprocess, multisession, cluster and nothing.

PS2: my local machine is Windows and try to connect to Kubuntu and Debian (firewall is off in all of those).

Thnx in advance.

回答1:

Author of future here. First, make sure you can setup the PSOCK cluster, i.e. connect to the two workers over SSH and run Rscript on them. This you do as:

library(future)
workers <- c("000.000.0.000", "111.111.1.111")
cl <- makeClusterPSOCK(workers, user = "...",
                       rshcmd = c("plink", "-ssh", "-pw",  "..."),
                       rshopts = c("-i", "V:/vbulavina/privatekey.ppk"),
                       homogeneous = FALSE)
print(cl)
### socket cluster with 2 nodes on hosts '000.000.0.000', '111.111.1.111'

(If the above makeClusterPSOCK() stalls or doesn't work, add argument verbose = TRUE to get more info - feel free to report back here.)

Next, with the PSOCK cluster set up, tell the future system to parallelize over those two workers:

plan(cluster, workers = cl)

Test that futures are actually resolved remotes, e.g.

f <- future(Sys.info()[["nodename"]])
print(value(f))
### [1] "000.000.0.000"

I leave the remaining part, which also needs adjustments, for now - let's make sure to get the workers up and running first.

Continuing, using source() in parallel processing complicates things, especially when the parallelization is done on different machines. For instance, calling source("my_file.R") on another machine requires that the file my_file.R is available on that machine too. Even if it is, it also complicates things when it comes to the automatic identification of variables that need to be exported to the external machine. A safer approach is to incorporate all the code in the main script. Having said all this, you can try to replace:

f <- future({source("pasos.r")})
l <- future({source("pasos2.R")})

with

futureSource <- function(file, envir = parent.frame(), ...) {
  expr <- parse(file)
  future(expr, substitute = FALSE, envir = envir, ...)
}

f <- futureSource("pasos.r")
l <- futureSource("pasos2.R")

As long as pasos.r and pasos2.R don't call source() internally, this c/should work.

BTW, what version of Windows are you on? Because with an up-to-date Windows 10, you have built-in support for SSH and you no longer need to use PuTTY.

UPDATE 2018-07-31: Continue answer regarding using source() in futures.