memory efficient prediction with randomForest in R

2019-07-05 04:04发布

问题:

TL;DR I want to know memory efficient ways of performing a batch prediction with randomForest models built on large datasets (hundreds of features, 10's of thousands of rows).

Details:

I'm working with a large data-set (over 3GB, in memory) and want to do a simple binary classification using randomForest. Since my data is proprietary, I cannot share it, but lets say the following code runs

library(randomForest)
library(data.table)

myData <- fread("largeDataset.tsv")
myFeatures <- myData[, !c("response"), with = FALSE]
myResponse <- myData[["response"]]

toBePredicted <- fread("unlabeledData.tsv")

rfObj <- randomForest(x = myFeatures, y = myResponse, ntree = 100L)

predictedLabels <- predict(rfObj, toBePredicted)

However, it takes several GB of memory.

I know that I can save memory by turning off a bunch of proximity and importance measures and keep.* arguments:

rfObjWithPreds <- randomForest(x = myFeatures,
                               y = myResponse,
                               proximity = FALSE,
                               localImp = FALSE,
                               importance = FALSE,
                               ntree = 100L,
                               keep.forest = FALSE,
                               keep.inbag = FALSE,
                               xtest = toBePredicted)

However I'm now wondering whether this is the most memory efficient way of getting predictions for toBePredicted. Another way I could do this is by growing the forest in parallel and actively performing garbage collection:

library(doParallel)
registerDoParallel(ncores = 5)

subForestVotes <- foreach(subForestNumber = iter(seq.int(5)),
                          .combine = cbind) %dopar% {
    rfObjWithPreds <- randomForest(x = myFeatures,
                               y = myResponse,
                               proximity = FALSE,
                               localImp = FALSE,
                               importance = FALSE,
                               ntree = 100L,
                               keep.forest = FALSE,
                               keep.inbag = FALSE,
                               xtest = toBePredicted)
   output <- rfObjWithPreds[["test"]][["votes"]]
   rm(rfObjWithPreds)
   return(output)
}

Does anyone have a smarter way of efficiently predicting toBePredicted?