Does random forest in R have a limitation of size

2020-05-24 06:39发布

问题:

I am training randomforest on my training data which has 114954 rows and 135 columns (predictors). And I am getting the following error.

model <- randomForest(u_b_stars~. ,data=traindata,importance=TRUE,do.trace=100, keep.forest=TRUE, mtry=30)

Error: cannot allocate vector of size 877.0 Mb
In addition: Warning messages:
1: In randomForest.default(m, y, ...) :
The response has five or fewer unique values.  Are you sure you want to do regression?
2: In matrix(double(nrnodes * nt), ncol = nt) :
Reached total allocation of 3958Mb: see help(memory.size)
3: In matrix(double(nrnodes * nt), ncol = nt) :
Reached total allocation of 3958Mb: see help(memory.size)
4: In matrix(double(nrnodes * nt), ncol = nt) :
Reached total allocation of 3958Mb: see help(memory.size) 
5: In matrix(double(nrnodes * nt), ncol = nt) :
Reached total allocation of 3958Mb: see help(memory.size)

I want to know know what do I do to avoid this error? Should I train it on less data? But that wont be good, of course. Can somebody suggest an alternative in which I don't have to take less data from training data. I want to use complete training data.

回答1:

As was stated in an answer to a previous question (which I can't find now), increasing the sample size affects the memory requirements of RF in a nonlinear way. Not only is the model matrix larger, but the default size of each tree, based on the number of points per leaf, is also larger.

To fit the model given your memory constraints, you can do the following:

  1. Increase the nodesize parameter to something bigger than the default, which is 5 for a regression RF. With 114k observations, you should be able to increase this significantly without hurting performance.

  2. Reduce the number of trees per RF, with the ntree parameter. Fit several small RFs, then combine them with combine to produce the entire forest.



回答2:

One alternative you could try if you can't use a machine with more memory is: train separate models on subsets of the data (say 10 separate subsets) and then combine the output of each model in a sensible way (the easiest way to do this is averaging the predictions of the 10 models but there are other ways to ensemble models http://en.wikipedia.org/wiki/Ensemble_learning).

Technically you would be using all your data without hitting the memory restriction, but depending on the size of the resulting subsets of the data the resulting models might be too weak to be of any use.