foreach() garbage collection

2020-07-09 07:46发布

问题:

I'm using nested foreach from the doSMP package to generate results based on a function I developed. Ordinarily the problem would use three nested loops, but due to the size of results generated (around 80,000 for each i), I've had to pause compilation and write the results to file when the final results matrix exceeds a specified number of rows.

i = 1
write.off = 1

while(i <= length(i.vector)){
        results.frame = as.data.frame(matrix(NA, ncol = 3, nrow = 1))

        while(nrow(results.frame) < 500000 & i <= length(i.vector)){
                results = foreach(j = 1:length(j.vector), .combine = "rbind", .inorder = TRUE) %:%
                foreach(k = 1:length(k.vector), .combine = "rbind", .inorder = TRUE) %dopar%{

                        ith.value = i.vector[i]
                        jth.value = j.vector[j]
                        kth.value = k.vector[k]
                        my.function(ith.value, jth.value, kth.value)
                }

                results.frame = rbind(results.frame, results)
                i = i + 1
        }

        results.frame = results.frame[-1,]
        write.table(results.frame, paste("part_",write.off, sep = ""))
        write.off = write.off + 1   
}

The problem I'm having is with garbage collection. The workers don't seem to reallocate memory back to the system, so by i = 4 they each have eaten up around 6GB of memory.

I've tried inserting gc() into the foreach loop directly as well as into the underlying function, and I've also tried assigning the function and its results to a named environment that I can clear periodically. None of these methods have worked.

I feel like foreach's initEnvir and finalEnvir parameters might offer a solution, but the documentation and examples haven't really shed much light on this.

I'm running this code on a VM operating Windows Server 2008.

回答1:

You might consider avoiding this issue altogether by writing a different loop.

Consider using the gen.factorial function in AlgDesign, a la:

fact1 = gen.factorial(c(length(i.vector), length(j.vector), length(k.vector)), nVars = 3, center = FALSE)
foreach(ix_row = 1:nrow(fact1)) %dopar% {
  my.function(fact1[ix_row,])
}

You could also use memory mapped files and pre-allocate the output storage using bigmemory (assuming you're creating a matrix) and that would make it feasible for each worker to store its output on its own.

In this way, your overall memory usage should drop dramatically.


Update 1: It seems that memory issues are endemic to doSMP. Check out the following posts:

  • Answer by Revo engineer discusses some memory & process issues
  • Joris Meys reports that doSMP crashes his R instances frequently

I recall seeing another memory issue for doSMP, either on as a question or in the R chat, but I can't seem to recover the post.

Update 2: I don't know if this will help, but you might try using an explicit return() (e.g. return(my.function(ith.value, jth.value, kth.value))). In my code, I generally use an explicit return() for clarity.