Random forest bootstrap training and forest genera

I have a huge training data for random forest (dim: 47600811*9). I want to take multiple (let's say 1000) bootstrapped sample of dimension 10000*9 (taking 9000 negative class and 1000 positive class datapoints in each run) and iteratively generate trees for all of them and then combine all those trees into 1 forest. A rough idea of required code is given below. Can anbody guide me how can I generate random sample with replacement from my actual trainData and optimally generate trees for them iteratively? It will be great help. Thanks

library(doSNOW)
library(randomForest)
cl <- makeCluster(8)
registerDoSNOW(cl)

for (i=1:1000){
B <- 1000 
U <- 9000 
dataB <- trainData[sample(which(trainData$class == "B"), B,replace=TRUE),] 
dataU <- trainData[sample(which(trainData$class == "U"), U,replace=TRUE),] 
subset <- rbind(dataB, dataU)

I am not sure if it is the optimal way of producing a subset again and again (1000 times) from actual trainData.

rf <- foreach(ntree=rep(125, 8), .packages='randomForest') %dopar% {
  randomForest(subset[,-1], subset$class, ntree=ntree)
}
}
crf <- do.call('combine', rf)
print(crf)
stopCluster(cl)

标签： r parallel-processing random-forest bootstrapping snow

2条回答

Fickle 薄情

2楼-- · 2019-09-19 15:11

Something like this would work

# Replicate expression 1000 times, store output of each replication in a list
# Find indices of class B and sample 9000 times with replacement
# Do the same 1000 times for class U. Combine the two vectors of indices

i = replicate(1000, {c(sample(which(trainData$class == "B"), 9000, replace = T), sample(which(trainData$class == "U"), 1000, replace = T))})

Then feed i into a parallel version of lapply

mclapply(i, function(i, ntree) randomForest(trainData[i,-1], trainData[i,]$class, ntree=ntree)

0人赞添加讨论(0) 举报

孤傲高冷的网名

3楼-- · 2019-09-19 15:28

Although your example parallelizes the inner rather than the outer loop, it may work reasonably well as long as the inner foreach loop takes more than a few seconds to execute, which it almost certainly does. However, your program does have a bug: it is throwing away the first 999 foreach results and only processing the last result. To fix this, you could preallocate a list of length 1000*8 and assign the results from foreach into it on each iteration of the outer for loop. For example:

library(doSNOW)
library(randomForest)
trainData <- data.frame(a=rnorm(20), b=rnorm(20),
                        class=c(rep("U", 10), rep("B", 10)))
n <- 1000         # outer loop count
chunksize <- 125  # value of ntree used in inner loop
nw <- 8           # number of cluster workers
cl <- makeCluster(nw)
registerDoSNOW(cl)
rf <- vector('list', n * nw)
for (i in 1:n) {
  B <- 1000
  U <- 9000
  dataB <- trainData[sample(which(trainData$class == "B"), B,replace=TRUE),]
  dataU <- trainData[sample(which(trainData$class == "U"), U,replace=TRUE),]
  subset <- rbind(dataB, dataU)
  ix <- seq((i-1) * nw + 1, i * nw)
  rf[ix] <- foreach(ntree=rep(chunksize, nw),
                    .packages='randomForest') %dopar% {
    randomForest(subset[,-1], subset$class, ntree=ntree)
  }
}
cat(sprintf("# models: %d; expected # models: %d\n", length(rf), n * nw))
cat(sprintf("expected total # trees: %d\n", n * nw * chunksize))
crf <- do.call('combine', rf)
print(crf)

This should fix the problem that you mention in the comment that you directed to me.

0人赞添加讨论(0) 举报

Random forest bootstrap training and forest genera

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间