-->

Using ordinal variables in rpart and caret without

2019-04-10 16:36发布

问题:

I am trying to create an ordinal regression tree in R using rpart, with the predictors mostly being ordinal data, stored as factor in R.

When I created the tree using rpart, I get something like this:

where the values are the factor values (E.g. A170 has labels ranging from -5 to 10).

However, when I use caret to train the data using rpart, when I extract the final model, the tree no longer has ordinal predictors. See below for a sample output tree

As you see above, it seems the ordinal variable A170 now has been converted into multiple dummy categorical value, i.e. A17010 in the second tree is a dummy for A170 of value 10.

So, is it possible to retain ordinal variables instead of converting factor variables into multiple binary indicator variables when fitting trees with the caret package?

回答1:

Let's start with a reproducible example:

set.seed(144)
dat <- data.frame(x=factor(sample(1:6, 10000, replace=TRUE)))
dat$y <- ifelse(dat$x %in% 1:2, runif(10000) < 0.1, ifelse(dat$x %in% 3:4, runif(10000) < 0.4, runif(10000) < 0.7))*1

As you note, training with the rpart function groups the factor levels together:

library(rpart)
rpart(y~x, data=dat)

I was able to reproduce the caret package splitting up the factors into their individual levels using the formula interface to the train function:

library(caret)
train(y~x, data=dat, method="rpart")$finalModel

The solution I found to avoid splitting factors by level is to input raw data frames to the train function instead of using the formula interface:

train(x=data.frame(dat$x), y=dat$y, method="rpart")$finalModel



标签: r r-caret rpart