I am running code in R within a linux cluster - the code is complex (over two thousand lines of code), involves over 40 R packages and several hundred variables. However, it does run both on Windows and linux versions of R.
I am now running the code on Edinburgh University EDCF high performance computing cluster and the code is ran in parallel. The parallel code is called within DEoptim which basically, after some initialization, runs a series of functions in parallel and the results are sent back to the DEoptim algorithm as well as being saved as a plot and data table on my own space - and importantly the code runs and works!
The code models the hydrology of a region and I can set the code to simulate historic conditions over any time period I want - from one day to 30 years. For one month in parallel, results are spat out approximately every 70 seconds and the DEoptim algorithm simply keeps re-running the code changing the input parameters trying to find the best set of parameters.
The code seems to run fine for a number of runs but eventually crashes. Last night the code completed over a 100 runs with no problem over approximately 2 hours but eventually crashed - and it always eventually crashes - with the error code:
Error in unserialize(node$con) : error reading from connection
The system I am logging onto is a 16 core server (16 true cores) according to:
detectCores()
and I requested 8 slots of 2GB memory. I have tried running this on a 24 core machine with large memory request (4 slots of 40GB memory) but it still eventually crashes. This code ran fine for several weeks on a Windows machine spitting out thousands of results, running in parallel across 8 logical cores.
So I believe the code is okay, but why is it crashing? Could it be a memory issue? Each time the sequence is called it includes:
rm(list=ls())
gc()
Or is it simply a core crashing? I did think at some point that it could be a problem if two cores were trying to write to the same data file at the same time but I removed this temporarily and it still crashed. Sometimes it crashes after a few minutes and other times after a couple of hours. I have tried removing one core from the parallel code using:
cl <- parallel::makeCluster(parallel::detectCores()-1)
but it still crashed.
Is there anyway that the code could be modified so it rejects crashed outputs e.g. if error then reject and carry on!!
Or, is there a way of modifying the code to catch why the error happened at all?
I know there are lots of other serialize(node$con) and unserialize(node$con) error posts but they don't seem to help me.
I'd really appreciate some help.
Thanks.
I had a similar problem running in parallel code that was dependent on several other packages. Try using foreach() with %dopar% and specify the packages your code depends on with the .packages option to load the packages onto each worker. Alternatively, judicious use of require() within the parallel code may also work.