TL;DR I want to know memory efficient ways of performing a batch prediction with randomForest models built on large datasets (hundreds of features, 10's of thousands of rows).
Details:
I'm working with a large data-set (over 3GB, in memory) and want to do a simple binary classification using randomForest
. Since my data is proprietary, I cannot share it, but lets say the following code runs
library(randomForest)
library(data.table)
myData <- fread("largeDataset.tsv")
myFeatures <- myData[, !c("response"), with = FALSE]
myResponse <- myData[["response"]]
toBePredicted <- fread("unlabeledData.tsv")
rfObj <- randomForest(x = myFeatures, y = myResponse, ntree = 100L)
predictedLabels <- predict(rfObj, toBePredicted)
However, it takes several GB of memory.
I know that I can save memory by turning off a bunch of proximity and importance measures and keep.*
arguments:
rfObjWithPreds <- randomForest(x = myFeatures,
y = myResponse,
proximity = FALSE,
localImp = FALSE,
importance = FALSE,
ntree = 100L,
keep.forest = FALSE,
keep.inbag = FALSE,
xtest = toBePredicted)
However I'm now wondering whether this is the most memory efficient way of getting predictions for toBePredicted
. Another way I could do this is by growing the forest in parallel and actively performing garbage collection:
library(doParallel)
registerDoParallel(ncores = 5)
subForestVotes <- foreach(subForestNumber = iter(seq.int(5)),
.combine = cbind) %dopar% {
rfObjWithPreds <- randomForest(x = myFeatures,
y = myResponse,
proximity = FALSE,
localImp = FALSE,
importance = FALSE,
ntree = 100L,
keep.forest = FALSE,
keep.inbag = FALSE,
xtest = toBePredicted)
output <- rfObjWithPreds[["test"]][["votes"]]
rm(rfObjWithPreds)
return(output)
}
Does anyone have a smarter way of efficiently predicting toBePredicted
?