Ensemble different datasets in R

2019-08-22 06:17发布

问题:

I am trying to combine signals from different models using the example described here . I have different datasets which predicts the same output. However, when I combine the model output in caretList, and ensemble the signals, it gives an error

Error in check_bestpreds_resamples(modelLibrary) : 
  Component models do not have the same re-sampling strategies

Here is the reproducible example

library(caret)
library(caretEnsemble)
df1 <-
  data.frame(x1 = rnorm(200),
             x2 = rnorm(200),
             y = as.factor(sample(c("Jack", "Jill"), 200, replace = T)))

df2 <-
  data.frame(z1 = rnorm(400),
             z2 = rnorm(400),
             y = as.factor(sample(c("Jack", "Jill"), 400, replace = T)))

library(caret)
check_1 <- train( x = df1[,1:2],y = df1[,3],
                 method = "nnet",
                 tuneLength = 10,
                 trControl = trainControl(method = "cv",
                                          classProbs = TRUE,
                                          savePredictions = T))

check_2 <- train( x = df2[,1:2],y = df2[,3] ,
                 method = "nnet",
                 preProcess = c("center", "scale"),
                 tuneLength = 10,
                 trControl = trainControl(method = "cv",
                                          classProbs = TRUE,
                                          savePredictions = T))


combine <- c(check_1, check_2)
ens <- caretEnsemble(combine)

回答1:

First of all, you are trying to combine 2 models trained on different training data sets. That is not going to work. All ensemble models will need to be based on the same training set. You will have different sets of resamples in each trained model. Hence your current error.

Also building your models without using caretList is dangerous because you will have a big change of getting different resample strategies. You can control that a bit better by using the index in trainControl (see vignette).

If you use 1 dataset you can use the following code:

ctrl <- trainControl(method = "cv",
                     number = 5,
                     classProbs = TRUE,
                     savePredictions = "final")

set.seed(1324)
# will generate the following warning:
# indexes not defined in trControl.  Attempting to set them ourselves, so 
# each model in the ensemble will have the same resampling indexes.
models <- caretList(x = df1[,1:2],
                    y = df1[,3] ,
                    trControl = ctrl,
                    tuneList = list(
                      check_1 = caretModelSpec(method = "nnet", tuneLength = 10),
                      check_2 = caretModelSpec(method = "nnet", tuneLength = 10, preProcess = c("center", "scale"))
                    )) 


ens <- caretEnsemble(models)


A glm ensemble of 2 base models: nnet, nnet

Ensemble results:
Generalized Linear Model 

200 samples
  2 predictor
  2 classes: 'Jack', 'Jill' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results:

  Accuracy   Kappa     
  0.5249231  0.04164767

Also read this guide on different ensemble strategies.