Error occurring in caret when running on a cluster

2019-05-03 01:28发布

问题:

I am running the train function in caret on a cluster via doRedis. For the most part, it works, but every so often I get errors at the very end of this nature:

error calling combine function:
<simpleError: obj$state$numResults <= obj$state$numValues is not TRUE>

and

Error in names(resamples) <- gsub("^\\.", "", names(resamples)) : 
  attempt to set an attribute on NULL

when I run traceback() I get:

5: nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method, 
       ppOpts = preProcess, ctrl = trControl, lev = classLevels, 
       ...)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(couple ~ ., training.balanced, method = "nnet", 
       preProcess = "range", tuneGrid = nnetGrid, MaxNWts = 2200)
1: caret::train(couple ~ ., training.balanced, method = "nnet", 
       preProcess = "range", tuneGrid = nnetGrid, MaxNWts = 2200)

These errors are not easily reproducible (i.e. they happen sometimes, but not consistently) and only occur at the end of the run. The stdout on the cluster shows all tasks running and completed, so I am a bit flummoxed.

Has anyone encountered these errors? and if so understand the cause and even better a fix?

回答1:

I imagine you've already solved this problem, but I ran into the same issue on my cluster consisting of linux and windows systems. I was running the server on ubuntu 14.04 and had noticed the warnings when starting the server service about having 'transparent huge pages' enabled in the linux kernel. I ignored that message and began running training exercises where most of the machines were maxed out with workers. I received the same error at the end of the run:

error calling combine function:
<simpleError: obj$state$numResults <= obj$state$numValues is not TRUE>

After a lot of head scratching and useless tinkering, I decided to address that warning by following the instructions here: http://ubuntuforums.org/showthread.php?t=2255151

Essentially, I installed hugeadm using:

sudo apt-get install hugeadm

Then disabled the transparent pages using:

hugeadm --thp-never

Note that this change will be undone on restart of the computer.

When I re-ran my training process it ran without any errors.

Hope that helps.

Cheers, Eric