Let's say I have some generic dataset for which an OLS regression is the best choice. So, I generate a model with some first-order terms and decide to use Caret in R for my regression coefficient estimates and error estimates.
In caret, this ends up being:
k10_cv = trainControl(method="cv", number=10)
ols_model = train(Y ~ X1 + X2 + X3, data = my_data, trControl = k10_cv, method = "lm")
From there, I can pull out regression information using summary(ols_model)
and can also pull some more information by just calling ols_model
.
When I just look at ols_model
, is the RMSE/R-square/MAE being calculated via the typical k-fold CV approach? Also, when the model I see in summary(ols_model)
is generated, is this model trained on the entire dataset or is it an average of models generated across each of the folds?
If not, in the interest of trading variance for bias, is there a way to acquire an OLS model within Caret that is trained on one fold at a time?
Here's reproducible data for your example.
library("caret")
my_data <- iris
k10_cv <- trainControl(method="cv", number=10)
set.seed(100)
ols_model <- train(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
data = my_data, trControl = k10_cv, method = "lm")
> ols_model$results
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 TRUE 0.3173942 0.8610242 0.2582343 0.03881222 0.04784331 0.02960042
1)The ols_model$results
above is based on the mean of each of the different resampling below:
> (ols_model$resample)
RMSE Rsquared MAE Resample
1 0.3386472 0.8954600 0.2503482 Fold01
2 0.3154519 0.8831588 0.2815940 Fold02
3 0.3167943 0.8904550 0.2441537 Fold03
4 0.2644717 0.9085548 0.2145686 Fold04
5 0.3769947 0.8269794 0.3070733 Fold05
6 0.3720051 0.7792611 0.2746565 Fold06
7 0.3258501 0.8095141 0.2647466 Fold07
8 0.2962375 0.8530810 0.2731445 Fold08
9 0.3059100 0.8351535 0.2611982 Fold09
10 0.2615792 0.9286246 0.2108592 Fold10
I.e.
> mean(ols_model$resample$RMSE)==ols_model$results$RMSE
[1] TRUE
2)The model is trained on the whole training set. You can check this with either using lm
or specify method = "none"
for the trainControl
.
coef(lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = my_data))
(Intercept) Sepal.Width Petal.Length Petal.Width
1.8559975 0.6508372 0.7091320 -0.5564827
Which is identical with ols_model$finalModel
.