-->

Warning message: “missing values in resampled perf

2020-01-31 00:53发布

问题:

I am using the caret package to train a model with "rpart" package;

tr = train(y ~ ., data = trainingDATA, method = "rpart")

Data has no missing values or NA's, but when running the command a warning message comes up;

    Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

Does anyone know (or could point me to where to find an answer) what does this warning mean? I know it is telling me that there were missing values in resampled performance measures - but what does that exactly mean and how can a situation like that arise? BTW, the predict() function works fine with the fitted model, so it is just my curiosity.

回答1:

Not definitively sure without more data.

If this is regression, the most likely case is that the tree did not find a good split and used the average of the outcome as the predictor. That's fine but you cannot calculate R^2 since the variance of the predictions is zero.

If classification, it's hard to say. You could have a resample where one of the outcome classes has zero samples so sensitivity or specificity is undefined and thus NA.



回答2:

The Problem

The problem is that the rpart is using a tree based algorithm, which can only handle a limited number of factors in a given feature. So you may have a variable that has been set to a factor with more than 53 categories:

> rf.1 <- randomForest(x = rf.train.2, 
+                      y = rf.label, 
+                      ntree = 1000)
Error in randomForest.default(x = rf.train.2, y = rf.label, ntree = 1000) : 
Can not handle categorical predictors with more than 53 categories.

At the base of your problem, caret is running that function, so make sure you fix up your categorical variables with more than 53 levels.

Here is where my problem lied before (notice zipcode coming in as a factor):

# ------------------------------- #
# RANDOM FOREST WITH CV 10 FOLDS  #
# ------------------------------- #
rf.train.2 <- df_train[, c("v1",
                      "v2",
                      "v3",
                      "v4",
                      "v5",
                      "v6",
                      "v7",
                      "v8",
                      "zipcode",
                      "price",
                      "made_purchase")]
rf.train.2 <- data.frame(v1=as.factor(rf.train.2$v1),
                     v2=as.factor(rf.train.2$v2),
                     v3=as.factor(rf.train.2$v3),
                     v4=as.factor(rf.train.2$v4),
                     v5=as.factor(rf.train.2$v5),
                     v6=as.factor(rf.train.2$v6),
                     v7=as.factor(rf.train.2$v7),
                     v8=as.factor(rf.train.2$v8),
                     zipcode=as.factor(rf.train.2$zipcode),
                     price=rf.train.2$price,
                     made_purchase=as.factor(rf.train.2$made_purchase))
rf.label <- rf.train.2[,"made_purchase"]

The Solution

Remove all categorical variables that have more than 53 levels.

Here is my fixed up code, adjusting the categorical variable zipcode, you could even have wrapped it in a numeric wrapper like this: as.numeric(rf.train.2$zipcode).

# ------------------------------- #
# RANDOM FOREST WITH CV 10 FOLDS  #
# ------------------------------- #
rf.train.2 <- df_train[, c("v1",
                      "v2",
                      "v3",
                      "v4",
                      "v5",
                      "v6",
                      "v7",
                      "v8",
                      "zipcode",
                      "price",
                      "made_purchase")]
rf.train.2 <- data.frame(v1=as.factor(rf.train.2$v1),
                     v2=as.factor(rf.train.2$v2),
                     v3=as.factor(rf.train.2$v3),
                     v4=as.factor(rf.train.2$v4),
                     v5=as.factor(rf.train.2$v5),
                     v6=as.factor(rf.train.2$v6),
                     v7=as.factor(rf.train.2$v7),
                     v8=as.factor(rf.train.2$v8),
                     zipcode=rf.train.2$zipcode,
                     price=rf.train.2$price,
                     made_purchase=as.factor(rf.train.2$made_purchase))
rf.label <- rf.train.2[,"made_purchase"]


回答3:

This error happens when the model didn't converge in some cross-validation folds the predictions get zero variance. As a result, the metrics like RMSE or Rsquared can't be calculated so they become NAs. Sometimes there are parameters you can tune for better convergence, e.g. the neuralnet library offers to increase threshold which almost always leads to convergence. Yet, I'm not sure about the rpart library.

Another reason for this to happen is that you have already NAs in your training data. Then the obvious cure is to remove them before passing them by train(data = na.omit(training.data)).

Hope that enlightens a bit.



回答4:

I was hitting the same error when fitting training data to a single decision tree. But it got resolved once I remove the NA values from the raw data before splitting in training and test set. I guess it was a mismatch of data when we split and fitting in model. Steps: 1: remove NA from raw data. 2: Now split in training and test set. 3: Train model now and hope it fixes error now.



标签: r rpart r-caret