How to handle errors in predict function of R?

2019-06-22 16:12发布

I have a dataframe df, I am building an machine learning model (C5.0 decision tree) to predict the class of a column (loan_approved):

Structure (not real data):

id occupation income  loan_approved
1  business   4214214 yes
2  business   32134   yes
3  business   43255   no
4  sailor     5642    yes
5  teacher    53335   no
6  teacher    6342    no

Process:

  • I randomly split the data frame into test and train, learned on train dataset (rows 1,2,3,5,6 train and row 4 as test)
  • In order to account for new categorical levels in one or many column, I used try function

Function:

    error_free_predict = function(x){
    output = tryCatch({
    predict(C50_model, newdata = test[x,], type = "class")
    }, error = function(e) {
    "no"
    })
    return(output)
    }

Applied the predict function:

test <- mutate(test, predicted_class = error_free_predict(1:NROW(test)))

Problem:

id occupation income loan_approved predicted_class
1  business   4214214 yes          no
2  business   32134   yes          no
3  business   43255   no           no
4  sailor     5642    yes          no
5  teacher    53335   no           no
6  teacher    6342    no           no

Question:

I know this is because the test data frame had a new level that was not present in train data, but should not my function work all cases except this?

P.S: did not use sapply because it was too slow

2条回答
干净又极端
2楼-- · 2019-06-22 16:35

I generally do this using a loop where any levels outside of the train would be recoded as NA by this function. Here train is the data that you used for training the model and test is the data which would be used for prediction.

for(i in 1:ncol(train)){
  if(is.factor(train[,i])){
    test[,i] <- factor(test[,i],levels=levels(train[,i]))
  }
}

Trycatch is an error handling mechanism, i.e. after the error has been encountered. It would not be applicable unless you would like to do something different after the error has been encountered. But you would still like to run the model, then this loop would take care of the new levels.

查看更多
beautiful°
3楼-- · 2019-06-22 16:49

There are two parts of this problem.

  1. First part of problem comes during training the model because categorical variables are not equally divided in between train and test if one do random splitting. In your case say you have only one record with occupation "sailor" then it is possible that it will end up in test set when you do random split. Model built using train dataset would have never seen impact of occupation "sailor" and hence it will throw error. In more generalized case it is possible some other categorical variable level goes entirely to test set after random splitting.

So instead of dividing the data randomly in between train and test you can do stratified sampling. Code using data.table for 70:30 split is :

ind <- total_data[, sample(.I, round(0.3*.N), FALSE),by="occupation"]$V1
train <- total_data[-ind,]
test <- total_data[ind,]

This makes sure any level is divided equally among train and test dataset. So you will not get "new" categorical level in test dataset; which in random splitting case could be there.

  1. Second part of the problem comes when model is in production and it encounters a altogether new variable which was not there in even training or test set. To tackle this problem one can maintain a list of all levels of all categorical variables by using lvl_cat_var1 <- unique(cat_var1) and lvl_cat_var2 <- unique(cat_var2) etc. Then before predict one can check for new level and filter:

    new_lvl_data <- total_data[!(var1 %in% lvl_cat_var1 & var2 %in% lvl_cat_var2)] 
    pred_data <- total_data[(var1 %in% lvl_cat_var1 & var2 %in% lvl_cat_var2)] 
    

then for the default prediction do:

new_lvl_data$predicted_class <- "no" 

and full blown prediction for pred_data.

查看更多
登录 后发表回答