I have a huge training data for random forest (dim: 47600811*9). I want to take multiple (let's say 1000) bootstrapped sample of dimension 10000*9 (taking 9000 negative class and 1000 positive class datapoints in each run) and iteratively generate trees for all of them and then combine all those trees into 1 forest. A rough idea of required code is given below. Can anbody guide me how can I generate random sample with replacement from my actual trainData and optimally generate trees for them iteratively? It will be great help. Thanks
library(doSNOW)
library(randomForest)
cl <- makeCluster(8)
registerDoSNOW(cl)
for (i=1:1000){
B <- 1000
U <- 9000
dataB <- trainData[sample(which(trainData$class == "B"), B,replace=TRUE),]
dataU <- trainData[sample(which(trainData$class == "U"), U,replace=TRUE),]
subset <- rbind(dataB, dataU)
I am not sure if it is the optimal way of producing a subset again and again (1000 times) from actual trainData.
rf <- foreach(ntree=rep(125, 8), .packages='randomForest') %dopar% {
randomForest(subset[,-1], subset$class, ntree=ntree)
}
}
crf <- do.call('combine', rf)
print(crf)
stopCluster(cl)
Something like this would work
Then feed
i
into a parallel version of lapplyAlthough your example parallelizes the inner rather than the outer loop, it may work reasonably well as long as the inner foreach loop takes more than a few seconds to execute, which it almost certainly does. However, your program does have a bug: it is throwing away the first 999 foreach results and only processing the last result. To fix this, you could preallocate a list of length 1000*8 and assign the results from foreach into it on each iteration of the outer for loop. For example:
This should fix the problem that you mention in the comment that you directed to me.