R - factor examcard has new levels

2019-08-01 14:30发布

问题:

I built a classification model in R using C5.0 given below:

library(C50)
library(caret)
a = read.csv("All_SRN.csv")
set.seed(123)
inTrain <- createDataPartition(a$anatomy, p = .70, list = FALSE)
training <- a[ inTrain,]
test <- a[-inTrain,]
Tree <- C5.0(anatomy ~ ., data = training, 
            trControl = trainControl(method = "repeatedcv", repeats = 10,
                                     classProb = TRUE))
TreePred <- predict(Tree, test)

The training set has features like - examcard, coil_used, anatomy_region, bodypart_anatomy and anatomy(target class). All the features are categorical variables. There are a total of 10k odd values, I divided the data into training and test data. The learner worked great with this training and test set partioned in 70:30 ratio, but the problem comes when I provide the test set with new values given below:

TreePred <- predict(Tree, test_add)

Here, test_add contains the already present test set and a set of new values and on executing the learner fails to classify the new values and throws the following error:

Error in model.frame.default(object$Terms, newdata, na.action = na.action, : factor examcard has new levels

I tried to merge the new factor levels with the existing one using:

Tree$xlevels[["examcard"]] <- union(Tree$xlevels[["examcard"]], levels(test_add$examcard))

But, this wasn't of much help since the code executed with the following message and didn't yield any fruitful result:

predict code called exit with value 1

The feaure examcard holds a good deal of primacy in the classification hence can't be ignored. How can these set of values be classified?

回答1:

You cannot create a prediction for factor levels in your test set that are absent in your training set. Your model will not have coefficients for these new factor levels.

If you are doing a 70/30 split, you need to repartition your data using caret::CreateDataPartition...

... or your own stratified sample function to ensure that all levels are represented in the training set: use the "split-apply-combine" approach: split the data set by examcard, and for each subset, apply the split, then combine the training subsets and the testing subsets.

See this question for more details.