I built a classification model in R using C5.0 given below:
library(C50)
library(caret)
a = read.csv("All_SRN.csv")
set.seed(123)
inTrain <- createDataPartition(a$anatomy, p = .70, list = FALSE)
training <- a[ inTrain,]
test <- a[-inTrain,]
Tree <- C5.0(anatomy ~ ., data = training,
trControl = trainControl(method = "repeatedcv", repeats = 10,
classProb = TRUE))
TreePred <- predict(Tree, test)
The training set has features like - examcard, coil_used, anatomy_region, bodypart_anatomy and anatomy
(target class). All the features are categorical variables. There are a total of 10k odd values, I divided the data into training and test data. The learner worked great with this training and test set partioned in 70:30 ratio, but the problem comes when I provide the test set with new values given below:
TreePred <- predict(Tree, test_add)
Here, test_add contains the already present test set and a set of new values and on executing the learner fails to classify the new values and throws the following error:
Error in model.frame.default(object$Terms, newdata, na.action = na.action, : factor examcard has new levels
I tried to merge the new factor levels with the existing one using:
Tree$xlevels[["examcard"]] <- union(Tree$xlevels[["examcard"]], levels(test_add$examcard))
But, this wasn't of much help since the code executed with the following message and didn't yield any fruitful result:
predict code called exit with value 1
The feaure examcard holds a good deal of primacy in the classification hence can't be ignored. How can these set of values be classified?
You cannot create a prediction for factor levels in your test set that are absent in your training set. Your model will not have coefficients for these new factor levels.
If you are doing a 70/30 split, you need to repartition your data using
caret::CreateDataPartition
...... or your own stratified sample function to ensure that all levels are represented in the training set: use the "split-apply-combine" approach: split the data set by examcard, and for each subset, apply the split, then combine the training subsets and the testing subsets.
See this question for more details.