Bizarre Behavior of randomForest Package When Drop

2019-04-13 04:11发布

问题:

I am running a random forest model that produces results that make absolutely no sense to me from a statistical perspective, and thus I'm convinced that something must be going wrong code-wise with the randomForest package.

The predicted / left hand side variable is, in at least this iteration of the model, a party ID with 3 possible outcomes: Democrat, Independent, Republican. I run the model, get results, fine. I'm at this point not super concerned with the results per se, but rather what happens when I make a small modification.

I then try to run it excluding Independents, and that's when things go awry in ways I find mystifying. Specifically, it loses almost all ability to predict anything, and labels almost all observations as belonging to the same class (Democrats).

"OK, fine, the information contained in the Independent observations was important to prediction." This was my first thought, although I couldn't for the life of me figure out why that would be true since the model is very bad at identifying independents.

Weirdly, a simple test of that hypothesis proved that it is not true. If I drop all but one observation with a party ID of Independent (sample size is a little over 4000 observations total, so 1 observation is a rounding error), the model performs fine. Thus, even when it is for all intents and purposes learning absolutely nothing from the "Independent" outcome category, the model runs as expected. It's only once the Independent category is actually removed that things go wrong.

Also, just to head off a potential suggestion: it does not appear to somehow be a result of the creation of a new LHS variable. If I instead just droplevels on party_id_3_cat after removing independents and use the same LHS variable, it leads to the same results.

The results of the "no independents" version (called "two_cat" below) and "only one independent" version ("leave_one" below) should, as far as I can imagine, be nearly identical because they have nearly identical data. And yet the actual results produced by the two models are dramatically different. I've racked my brain and can't imagine why this would be true. Does anyone know anything about the randomForest package (or about random forest models in general, although that seems less likely) that could explain this behavior? If it's important, both the LHS and RHS variables are factor variables.

Thanks in advance!

Code:

load("three_cat.Rda")
two_cat<-subset(three_cat,party_id_3_cat!="2. Independents")
leave_one_in<-subset(three_cat,party_id_3_cat!="2.     Independents"|case_id=="30")
two_cat$party_id_2_cat<-as.factor(ifelse(two_cat$party_id_3_cat=="1. Democrats (including leaners)","Dem","Rep"))

rf_three_cat  <-        randomForest(party_id_3_cat~[RHS VARS},
                        data=three_cat,
                        ntree=200,mtry=4,
                        type="classification",
                        importance=TRUE,confusion=TRUE)
rf_leave_one  <-randomForest(party_id_3_cat~[RHS VARS],
                         data=leave_one_in,
                         ntree=200,mtry=4,
                         type="classification",
                         importance=TRUE,confusion=TRUE)
rf_two_cat    <-randomForest(party_id_2_cat~[RHS VARS],
                         data=two_cat,
                         ntree=200,mtry=4,
                         type="classification",
                         importance=TRUE,confusion=TRUE)

rf_three_cat$confusion
rf_leave_one$confusion
rf_two_cat$confusion

Results:

> rf_three_cat$confusion
                                   1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1150               3                                668   0.3684789
2. Independents                                                 296               4                                231   0.9924670
3. Republicans (including leaners)                              600               9                               1055   0.3773234

> rf_leave_one$confusion
                                   1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1080               0                                741   0.4069193
2. Independents                                                   0               0                                  1   1.0000000
3. Republicans (including leaners)                              517               0                               1097   0.3203222

> rf_two_cat$confusion
     Dem Rep class.error
Dem 1776  45   0.0247117
Rep 1581  33   0.9795539