How to build random forests in R with missing (NA)

I would like to fit a random forest model, but when I call

library(randomForest)
cars$speed[1] <- NA # to simulate missing value
model <- randomForest(speed ~., data=cars)

I get the following error

Error in na.fail.default(list(speed = c(NA, 4, 7, 7, 8, 9, 10, 10, 10,  : 
   missing values in object

标签： r machine-learning random-forest na missing-data

2条回答

Root（大扎）

2楼-- · 2019-01-21 00:20

My initial reaction to this question was that it didn't show much research effort, since "everyone" knows that random forests don't handle missing values in predictors. But upon checking ?randomForest I must confess that it could be much more explicit about this.

(Although, Breiman's PDF linked to in the documentation does explicitly say that missing values are simply not handled at all.)

The only obvious clue in the official documentation that I could see was that the default value for the na.action parameter is na.fail, which might be too cryptic for new users.

In any case, if your predictors have missing values, you have (basically) two choices:

Use a different tool (rpart handles missing values nicely.)
Impute the missing values

Not surprisingly, the randomForest package has a function for doing just this, rfImpute. The documentation at ?rfImpute runs through a basic example of its use.

If only a small number of cases have missing values, you might also try setting na.action = na.omit to simply drop those cases.

And of course, this answer is a bit of a guess that your problem really is simply having missing values.

0人赞添加讨论(0) 举报

霸刀☆藐视天下

3楼-- · 2019-01-21 00:38

If there is possibility that missing values are informative then you can inpute missing values and add additional binary variables (with new.vars<-is.na(your_dataset) ) and check if it lowers error, if new.var is too large set to add it to your_dataset then you could use it alone, pick significiant variables with varImpPlot and add them to your_dataset, you could also try to add single variable to your_dataset which counts number of NA's new.var <- rowSums(new.vars)

This is not off-topick answer, if missing variables are informative accounting for them could correct for increase of model error due to inperfect imputation procedure alone.

Missing values are informative then they arise due to non-random causes, its expecially common in social experiments settings.

0人赞添加讨论(0) 举报

How to build random forests in R with missing (NA)

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间