R - factor examcard has new levels

2019-08-01 14:32发布

I built a classification model in R using C5.0 given below:

library(C50)
library(caret)
a = read.csv("All_SRN.csv")
set.seed(123)
inTrain <- createDataPartition(a$anatomy, p = .70, list = FALSE)
training <- a[ inTrain,]
test <- a[-inTrain,]
Tree <- C5.0(anatomy ~ ., data = training, 
            trControl = trainControl(method = "repeatedcv", repeats = 10,
                                     classProb = TRUE))
TreePred <- predict(Tree, test)

The training set has features like - examcard, coil_used, anatomy_region, bodypart_anatomy and anatomy(target class). All the features are categorical variables. There are a total of 10k odd values, I divided the data into training and test data. The learner worked great with this training and test set partioned in 70:30 ratio, but the problem comes when I provide the test set with new values given below:

TreePred <- predict(Tree, test_add)

Here, test_add contains the already present test set and a set of new values and on executing the learner fails to classify the new values and throws the following error:

Error in model.frame.default(object$Terms, newdata, na.action = na.action, : factor examcard has new levels

I tried to merge the new factor levels with the existing one using:

Tree$xlevels[["examcard"]] <- union(Tree$xlevels[["examcard"]], levels(test_add$examcard))

But, this wasn't of much help since the code executed with the following message and didn't yield any fruitful result:

predict code called exit with value 1

The feaure examcard holds a good deal of primacy in the classification hence can't be ignored. How can these set of values be classified?

1条回答
你好瞎i
2楼-- · 2019-08-01 15:25

You cannot create a prediction for factor levels in your test set that are absent in your training set. Your model will not have coefficients for these new factor levels.

If you are doing a 70/30 split, you need to repartition your data using caret::CreateDataPartition...

... or your own stratified sample function to ensure that all levels are represented in the training set: use the "split-apply-combine" approach: split the data set by examcard, and for each subset, apply the split, then combine the training subsets and the testing subsets.

See this question for more details.

查看更多
登录 后发表回答