-->

How does Caret generate an OLS model with K-fold c

2019-08-15 18:16发布

问题:

Let's say I have some generic dataset for which an OLS regression is the best choice. So, I generate a model with some first-order terms and decide to use Caret in R for my regression coefficient estimates and error estimates.

In caret, this ends up being:

k10_cv = trainControl(method="cv", number=10)
ols_model = train(Y ~ X1 + X2 + X3, data = my_data, trControl = k10_cv, method = "lm")

From there, I can pull out regression information using summary(ols_model) and can also pull some more information by just calling ols_model.

When I just look at ols_model, is the RMSE/R-square/MAE being calculated via the typical k-fold CV approach? Also, when the model I see in summary(ols_model) is generated, is this model trained on the entire dataset or is it an average of models generated across each of the folds?

If not, in the interest of trading variance for bias, is there a way to acquire an OLS model within Caret that is trained on one fold at a time?

回答1:

Here's reproducible data for your example.

library("caret")
my_data <- iris

k10_cv <- trainControl(method="cv", number=10)

set.seed(100)
ols_model <- train(Sepal.Length ~  Sepal.Width + Petal.Length + Petal.Width,
                  data = my_data, trControl = k10_cv, method = "lm")


> ols_model$results
  intercept      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
1      TRUE 0.3173942 0.8610242 0.2582343 0.03881222 0.04784331 0.02960042

1)The ols_model$results above is based on the mean of each of the different resampling below:

> (ols_model$resample)
        RMSE  Rsquared       MAE Resample
1  0.3386472 0.8954600 0.2503482   Fold01
2  0.3154519 0.8831588 0.2815940   Fold02
3  0.3167943 0.8904550 0.2441537   Fold03
4  0.2644717 0.9085548 0.2145686   Fold04
5  0.3769947 0.8269794 0.3070733   Fold05
6  0.3720051 0.7792611 0.2746565   Fold06
7  0.3258501 0.8095141 0.2647466   Fold07
8  0.2962375 0.8530810 0.2731445   Fold08
9  0.3059100 0.8351535 0.2611982   Fold09
10 0.2615792 0.9286246 0.2108592   Fold10

I.e.

> mean(ols_model$resample$RMSE)==ols_model$results$RMSE
[1] TRUE

2)The model is trained on the whole training set. You can check this with either using lm or specify method = "none" for the trainControl.

 coef(lm(Sepal.Length ~  Sepal.Width + Petal.Length + Petal.Width, data = my_data))
 (Intercept)  Sepal.Width Petal.Length  Petal.Width 
   1.8559975    0.6508372    0.7091320   -0.5564827 

Which is identical with ols_model$finalModel.