R - Random Forest and more than 53 categories

2019-03-04 22:53发布

I know. RandomForest is not able to handle more than 53 categories. Sadly I have to analyze data and one column has 165 levels. Therefor I want to use RandomForest for a classification.

My problem is I cannot remove this columns since this predictor is really important and known as a valuable predictor.

This predictor has 165 levels and is a factor.

Are there any tips how I can handle this? Since we are talking about film genre I have no idea.

Are there alternative packages for big data? A special workaround? Something like this..

Switching to Python is no option. We have too many R scripts here.

Thanks a lot and all the best

The str(data) looks like this:

'data.frame':   481696 obs. of  18 variables:
 $ SENDERNR          : int  432 1612 735 721 436 436 1321 721 721 434 ...
 $ SENDER            : Factor w/ 14 levels "ARD Das Erste",..: 6 3 4 9 12 12 10 9 9 7 ...
 $ GEPLANTE_SENDUNG_N: Factor w/ 12563 levels "-- nicht bekannt --",..: 7070 808 5579 9584 4922 4922 12492 1933 9584 4533 ...
 $ U_N_PROGRAMMCODE  : Factor w/ 14 levels "Bühne/Aufführung",..: 9 4 8 4 8 8 12 8 4 2 ...
 $ U_N_PROGRAMMSPARTE: Factor w/ 6 levels "Anderes","Fiction",..: 5 3 2 3 2 2 5 2 3 3 ...
 $ U_N_SENDUNGSFORMAT: Factor w/ 29 levels "Bühne / Aufführung",..: 20 9 19 4 19 19 24 19 4 16 ...
 $ U_N_GENRE         : Factor w/ 163 levels "Action / Abenteuer",..: 119 147 115 4 158 158 163 61 4 84 ...
 $ U_N_PRODUKTIONSART: Factor w/ 5 levels "Eigen-, Co-, Auftragsproduktion, Cofinanzierung",..: 1 1 3 1 3 3 1 3 1 1 ...
 $ U_N_HERKUNFTSLAND : Factor w/ 25 levels "afrikanische Länder",..: 16 16 25 16 15 15 16 25 16 16 ...
 $ GEPLANTE_SENDUNG_V: Factor w/ 12191 levels "-- nicht bekannt --",..: 6932 800 5470 9382 1518 9318 12119 1829 9382 4432 ...
 $ U_V_PROGRAMMCODE  : Factor w/ 13 levels "Bühne/Aufführung",..: 9 4 8 4 8 8 12 8 4 2 ...
 $ U_V_PROGRAMMSPARTE: Factor w/ 6 levels "Anderes","Fiction",..: 5 3 2 3 2 2 5 2 3 3 ...
 $ U_V_SENDUNGSFORMAT: Factor w/ 28 levels "Bühne / Aufführung",..: 20 9 19 4 19 19 24 19 4 16 ...
 $ U_V_GENRE         : Factor w/ 165 levels "Action / Abenteuer",..: 119 148 115 4 160 19 165 61 4 84 ...
 $ U_V_PRODUKTIONSART: Factor w/ 5 levels "Eigen-, Co-, Auftragsproduktion, Cofinanzierung",..: 1 1 3 1 3 3 1 3 1 1 ...
 $ U_V_HERKUNFTSLAND : Factor w/ 25 levels "afrikanische Länder",..: 16 16 25 16 15 9 16 25 16 16 ...
 $ ABGELEHNT         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ AKZEPTIERT        : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 1 2 2 2 ...

2条回答
狗以群分
2楼-- · 2019-03-04 23:30

Having faced the same issue, here are some tips I can list.

  1. Switch to another algorithm, for instance gradient boosting from gbm package. You can handle up to 1024 categorical levels. If your predictor has quite discriminant parameters, you should also consider probabilistic approaches such as naiveBayes.
  2. Transform your predictor into dummy variables, which can be done by using matrix.model. You can then perform a random forest over this matrix.
  3. Reduce the number of levels in your factor. Ok, that may sound like a silly advice, but is it really relevant to look at factors with such "thinness" ? Is it possible for you to aggregate some modalities at a broader level ?

EDIT TO ADD MODEL.MATRIX EXAMPLE

As mentioned, here is an example on how to use model.matrix to transform your column into dummy variables.

mydf <- data.frame(var1 = factor(c("A", "A", "A", "B", "B", "C")),
                   var2 = factor(c("X", "Y", "X", "Y", "X", "Z")),
                   target = c(1,1,1,2,2,2))
dummyMat <- model.matrix(target ~ var1 + var2, mydf, # set contrasts.arg to keep all levels
                         contrasts.arg = list(var1 = contrasts(mydf$var1, contrasts = F), 
                                             var2 = contrasts(mydf$var2, contrasts = F))) 
mydf2 <- cbind(mydf, dummyMat[,c(2:ncol(dummyMat)]) # just removing intercept column
查看更多
The star\"
3楼-- · 2019-03-04 23:54

Use the caret package :

random_forest <- train("***dependent variable name***" ~ ., 
                 data = "***your training data set***", 
                 method = "ranger")
 This can handle 53 + categories.
查看更多
登录 后发表回答