Caret: There were missing values in resampled perf

2020-06-29 04:49发布

I am running caret's neural network on the Bike Sharing dataset and I get the following error message:

In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.

I am not sure what the problem is. Can anyone help please?

The dataset is from: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Here is the coding:

library(caret)
library(bestNormalize)

data_hour = read.csv("hour.csv")

# Split dataset
set.seed(3)
split = createDataPartition(data_hour$casual, p=0.80, list=FALSE)    
validation = data_hour[-split,]
dataset = data_hour[split,]
dataset = dataset[,c(-1,-2,-4)]  

# View strucutre of data
str(dataset)

# 'data.frame': 13905 obs. of  14 variables:
# $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
# $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
# $ hr        : int  1 2 3 5 8 10 11 12 14 15 ...
# $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
# $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
# $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
# $ weathersit: int  1 1 1 2 1 1 1 1 2 2 ...
# $ temp      : num  0.22 0.22 0.24 0.24 0.24 0.38 0.36 0.42 0.46 0.44 ...
# $ atemp     : num  0.273 0.273 0.288 0.258 0.288 ...
# $ hum       : num  0.8 0.8 0.75 0.75 0.75 0.76 0.81 0.77 0.72 0.77 ...
# $ windspeed : num  0 0 0 0.0896 0 ...
# $ casual    : int  8 5 3 0 1 12 26 29 35 40 ...
# $ registered: int  32 27 10 1 7 24 30 55 71 70 ...
# $ cnt       : int  40 32 13 1 8 36 56 84 106 110 ...

## transform numeric data to Guassian
dataset_selected = dataset[,c(-13,-14)]                                                
for (i in 8:12) { dataset_selected[,i] = predict(boxcox(dataset_selected[,i]   +0.1))}  

# View transformed dataset
str(dataset_selected)

#'data.frame':  13905 obs. of  12 variables:
#' $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
#' $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
#' $ hr        : int  1 2 3 5 8 10 11 12 14 15 ...
#' $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
#' $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
#' $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
#' $ weathersit: int  1 1 1 2 1 1 1 1 2 2 ...
#' $ temp      : num  -1.47 -1.47 -1.35 -1.35 -1.35 ...
#' $ atemp     : num  -1.18 -1.18 -1.09 -1.27 -1.09 ...
#' $ hum       : num  0.899 0.899 0.637 0.637 0.637 ...
#' $ windspeed : num  -1.8 -1.8 -1.8 -0.787 -1.8 ...
#' $ casual    : num  -0.361 -0.588 -0.81 -1.867 -1.208 ...


# Train data with Neural Network model from caret
control = trainControl(method = 'repeatedcv', number = 10, repeats =3)
metric = 'RMSE'
set.seed(3)
fit = train(casual ~., data = dataset_selected, method = 'nnet', metric = metric, trControl = control, trace = FALSE)

Thanks for your help!

标签: r r-caret nnet
3条回答
爷的心禁止访问
2楼-- · 2020-06-29 05:12

phivers comment is spot on, however I would still like to provide a more verbose answer on this concrete example.

In order to investigate what is going on in more detail one should add the argument savePredictions = "all" to trainControl:

control = trainControl(method = 'repeatedcv',
                       number = 10,
                       repeats = 3,
                       returnResamp = "all",
                       savePredictions = "all")

metric = 'RMSE'
set.seed(3)
fit = train(casual ~.,
            data = dataset_selected,
            method = 'nnet',
            metric = metric,
            trControl = control,
            trace = FALSE,
            form = "traditional")

now when running:

fit$results
#output
  size decay      RMSE  Rsquared       MAE      RMSESD RsquaredSD       MAESD
1    1 0e+00 0.9999205       NaN 0.8213177 0.009655872         NA 0.007919575
2    1 1e-04 0.9479487 0.1850270 0.7657225 0.074211541 0.20380571 0.079640883
3    1 1e-01 0.8801701 0.3516646 0.6937938 0.074484860 0.20787440 0.077960642
4    3 0e+00 0.9999205       NaN 0.8213177 0.009655872         NA 0.007919575
5    3 1e-04 0.9272942 0.2482794 0.7434689 0.091409600 0.24363651 0.098854133
6    3 1e-01 0.7943899 0.6193242 0.5944279 0.011560524 0.03299137 0.013002708
7    5 0e+00 0.9999205       NaN 0.8213177 0.009655872         NA 0.007919575
8    5 1e-04 0.8811411 0.3621494 0.6941335 0.092169810 0.22980560 0.098987058
9    5 1e-01 0.7896507 0.6431808 0.5870894 0.009947324 0.01063359 0.009121535

we notice the problem occurs when decay = 0.

lets filter the observations and predictions for decay = 0

library(tidyverse)
fit$pred %>%
  filter(decay == 0) -> for_r2

var(for_r2$pred)
#output 
0

we can observe that all of the predictions when decay == 0 are the same (have zero variance). The model exclusively predicts 0:

unique(for_r2$pred)
#output 
0

So when the summary function tries to predict R squared:

caret::R2(for_r2$obs, for_r2$pred)
#output
[1] NA
Warning message:
In cor(obs, pred, use = ifelse(na.rm, "complete.obs", "everything")) :
  the standard deviation is zero
查看更多
别忘想泡老子
3楼-- · 2020-06-29 05:17

The answer by @missuse is already very insightful to understand why this error happens.

So I just want to add some straightforward ways how to get rid of this error.

If in some cross-validation folds the predictions get zero variance, the model didn't converge. In such cases, you can try the neuralnet package which offers two parameters you can tune:

  1. threshold : default value = 0.01. Set it to 0.3 and then try lower values 0.2, 0.1, 0.05.
  2. stepmax : default value = 1e+05. Set it to 1e+08 and then try lower values 1e+07, 1e+06.

In most cases, it is sufficient to change the threshold parameter like this:

  model.nn <- caret::train(formula1,
                           method = "neuralnet",
                           data = training.set[,],
                           # apply preProcess within cross-validation folds
                           preProcess = c("center", "scale"),
                           trControl = trainControl(method = "repeatedcv",
                                                    number = 10,
                                                    repeats = 3),
                           threshold = 0.3
  )
查看更多
男人必须洒脱
4楼-- · 2020-06-29 05:21

Answer by @topepo (Caret package main developer). See detailed Github thread here.

It looks like it happens when you have one hidden unit and almost no regularization. What is happening is that the model is predicting a value very close to a constant (so that the RMSE is a little worse than the basic st deviation of the outcome):

> ANN_cooling_fit$resample %>% dplyr::filter(is.na(Rsquared))
      RMSE Rsquared      MAE size decay     Resample
1 8.414010       NA 6.704311    1 0e+00 Fold04.Rep01
2 8.421244       NA 6.844363    1 0e+00 Fold01.Rep03
3 7.855925       NA 6.372947    1 1e-04 Fold10.Rep07
4 7.963816       NA 6.428947    1 0e+00 Fold07.Rep09
5 8.492898       NA 6.901842    1 0e+00 Fold09.Rep09
6 7.892527       NA 6.479474    1 0e+00 Fold10.Rep10
> sd(mydata$V7)
[1] 7.962888

So it's nothing to really worry about; just some parameters that do very poorly.

查看更多
登录 后发表回答