R - factor examcard has new levels

I built a classification model in R using C5.0 given below:

library(C50)
library(caret)
a = read.csv("All_SRN.csv")
set.seed(123)
inTrain <- createDataPartition(a$anatomy, p = .70, list = FALSE)
training <- a[ inTrain,]
test <- a[-inTrain,]
Tree <- C5.0(anatomy ~ ., data = training, 
            trControl = trainControl(method = "repeatedcv", repeats = 10,
                                     classProb = TRUE))
TreePred <- predict(Tree, test)

The training set has features like - examcard, coil_used, anatomy_region, bodypart_anatomy and anatomy(target class). All the features are categorical variables. There are a total of 10k odd values, I divided the data into training and test data. The learner worked great with this training and test set partioned in 70:30 ratio, but the problem comes when I provide the test set with new values given below:

TreePred <- predict(Tree, test_add)

Here, test_add contains the already present test set and a set of new values and on executing the learner fails to classify the new values and throws the following error:

Error in model.frame.default(object$Terms, newdata, na.action = na.action, : factor examcard has new levels

I tried to merge the new factor levels with the existing one using:

Tree$xlevels[["examcard"]] <- union(Tree$xlevels[["examcard"]], levels(test_add$examcard))

But, this wasn't of much help since the code executed with the following message and didn't yield any fruitful result:

predict code called exit with value 1

The feaure examcard holds a good deal of primacy in the classification hence can't be ignored. How can these set of values be classified?

标签： r classification predict training-data test-data

1条回答

你好瞎i

2楼-- · 2019-08-01 15:25

You cannot create a prediction for factor levels in your test set that are absent in your training set. Your model will not have coefficients for these new factor levels.

If you are doing a 70/30 split, you need to repartition your data using caret::CreateDataPartition...

... or your own stratified sample function to ensure that all levels are represented in the training set: use the "split-apply-combine" approach: split the data set by examcard, and for each subset, apply the split, then combine the training subsets and the testing subsets.

See this question for more details.

0人赞添加讨论(0) 举报

R - factor examcard has new levels

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间