I use caret in R. My final goal is to submit different dataframes to separate preProcess pca and then put the PCA-components together in one training with ridge regression. However, see example code below where I don't get the same results when applying pca in preProcess within versus outside/before train function.
- Why do I not get the same results?
- And how do I achieve my main goal in the best way?
#Sample data
s <- c(-0.412440717220306, -0.459911376237869, -0.234769582748413, -0.332282930612564, -0.486973077058792, -0.301480442285538, -0.181094691157341, -0.240918189287186, 0.0962697193026543, -0.119731709361076, -0.389783203601837, -0.217093095183372, -0.302948802709579, -0.406619131565094, 0.247409552335739, -0.406119048595428, 0.0574243739247322, -0.301231145858765, -0.229316398501396, -0.0620433799922466)
t <- c(0.20061232149601, 0.0536709427833557, 0.530373573303223, 0.523406386375427, 0.267315864562988, 0.413556098937988, 0.274257719516754, 0.275401413440704, 0.634453296661377, 0.145272701978683, 0.196711808443069, 0.332845687866211, 0.345706522464752, 0.444085538387299, 0.253269702196121, 0.231440827250481, -0.196317762136459, 0.49691703915596, 0.43754768371582, 0.0106721892952919)
u <- c(-0.565160751342773, 0.377725303173065,-0.273447960615158, -0.338064402341843, -0.59904420375824, -0.780133605003357,-0.508388638496399, -0.226167500019073, -0.257708549499512, -0.349863946437836,-0.443032741546631, -0.36387038230896, -0.455201774835587, -0.137616977095604,0.130770832300186, -0.420618057250977, -0.125859051942825, -0.382272869348526, -0.355217516422272, -0.0601325333118439)
v <- c(-0.45850995182991, -0.0105021595954895, -0.475157409906387, -0.325350821018219, -0.548444092273712, -0.562069535255432, -0.473256289958954, -0.492668628692627, -0.205974608659744, -0.266964733600616, -0.289298176765442, -0.615423858165741, -0.261823982000351, -0.472221553325653, -0.684594392776489, -0.42777806520462, -0.240604877471924, -0.589631199836731, -0.782602787017822, -0.468854814767838)
w <- c(-0.886135756969452, -0.96577262878418,-0.755464434623718, -0.640497982501984, -0.849709093570709, -0.837802410125732, -0.659287571907043, -0.646972358226776, 0.0532735884189606, -0.646163880825043,-0.963890254497528, -0.91286826133728, -1.10484659671783, -0.596551716327667, -0.371927708387375, -0.684276521205902, -0.55376398563385, -0.969008028507233, -0.956810772418976, -0.0229262933135033)
y <- c(9, 26, 30, 15, 25, 30, 30, 35, 35, 30, 21, 30, 9, 33, 31, 34, 29, 35, 25, 31)
#Sample data for procedure 1 and 2
df_test1 <- data.frame(s, t, u, v, w)
df_test2 <- df_test1
#PROCEDURE 1: preProcess (pca) applied WITHIN "train" function
library(caret)
ytrain_df_test <- c(1:nrow(df_test1)) # number of observation that should be split in to the number of folds.
ntrain <- length(ytrain_df_test)
# define folds
cv_folds <- createFolds(ytrain_df_test, k = 10, list = TRUE, returnTrain = TRUE) #, ...)
# define training control
train_control <- trainControl(method="cv", index = cv_folds, savePredictions = 'final') #, ...)
#adding y
df_test1$y <- y
# train the model
set.seed(1)
model1 <- caret::train(y~., data=df_test1, trControl=train_control, method= 'ridge', preProcess = 'pca')
output1 <- list(model1, model1$pred, summary(model1$pred), cor.test(model1$pred$pred, model1$pred$obs))
names(output1) <- c("Model", "Model_pred", "Summary", "Correlation")
output1
#PROCEDURE 2: preProcess (pca) applied OUTSIDE/BEFORE "train" function
ytrain_df_test <- c(1:nrow(df_test2)) # number of observation that should be split in to the number of folds.
ntrain <- length(ytrain_df_test)
df2 <- preProcess(df_test2, method="pca", thresh = 0.95)
df_test2 <- predict(df2, df_test2)
df_test2$y <- y
df_test2
# define folds
cv_folds <- createFolds(ytrain_df_test, k = 10, list = TRUE, returnTrain = TRUE)
# define training control
train_control <- trainControl(method="cv", index = cv_folds, savePredictions = 'final')
# train the model
set.seed(1)
model2 <- caret::train(y~., data=df_test2, trControl=train_control, method= 'ridge') #, preProcess = 'pca')
model2
output2 <- list(model2, model2$pred, summary(model2$pred), cor.test(model2$pred$pred, model2$pred$obs))
names(output2) <- c("Model", "Model_pred", "Summary", "Correlation")
output2```
1.
when you perform preProcess (pca) within the train function:
- pca is run on each train set during CV and the train set is transformed
- several ridge regression models are estimated (based on the defined hyper parameter search) on each of these transformed train sets.
- based on the pca obtained for each train set the appropriate test set is transformed
- all of the fitted models are evaluated on appropriate transformed test sets
When this is finished the final model is built with hyper parameters which had the best average performance on the test sets:
- pca is applied on the whole train set data and transformed train data is obtained.
- using the pre-chosen hyper parameters a ridge regression model is built on the transformed train data
When you perform preProcess (pca) before the train function you are causing data leakage since you are using information from your CV test folds to estimate the pca coordinates. This causes optimistic bias during CV and should be avoided.
2.
I am not aware of inbuilt caret functionality that would provide this juggling with several data sets.
I trust this can be achieved with mlr3pipelines. Especially this tutorial is handy.
Here is an example on how to split the iris data set into two data sets, apply scaling and pca on each of them, combine the transformed columns and fit a rpart model. Tuning the number of PCA components retained as well one rpart hyper parameter using random search:
packages:
library(mlr3pipelines)
library(visNetwork)
library(mlr3learners)
library(mlr3tuning)
library(mlr3)
library(paradox)
define a pipeop selector named "slct1":
pos1 <- po("select", id = "slct1")
tell it which columns to select:
pos1$param_set$values$selector <- selector_name(c("Sepal.Length", "Sepal.Width"))
tell it what to do after it takes the features
pos1 %>>%
mlr_pipeops$get("scale", id = "scale1") %>>%
mlr_pipeops$get("pca", id = "pca1") -> pr1
define a pipeop selector named "slct2":
pos2 <- po("select", id = "slct2")
tell it which columns to select:
pos2$param_set$values$selector <- selector_name(c("Petal.Length", "Petal.Width"))
tell it what to do after it takes the features
pos2 %>>%
mlr_pipeops$get("scale", id = "scale2") %>>%
mlr_pipeops$get("pca", id = "pca2") -> pr2
combine the two outputs:
piper <- gunion(list(pr1, pr2)) %>>%
mlr_pipeops$get("featureunion")
and pipe them into a learner:
graph <- piper %>>%
mlr_pipeops$get("learner",
learner = mlr_learners$get("classif.rpart"))
lets check how it looks:
graph$plot(html = TRUE)
now define how this should be tuned:
glrn <- GraphLearner$new(graph)
10 fold CV:
cv10 <- rsmp("cv", folds = 10)
tune the number of PCA dimensions retained for each data set as well the complexity parameter of rpart:
ps <- ParamSet$new(list(
ParamDbl$new("classif.rpart.cp", lower = 0, upper = 1),
ParamInt$new("pca1.rank.", lower = 1, upper = 2),
ParamInt$new("pca2.rank.", lower = 1, upper = 2)
))
define the task and the tuning:
task <- mlr_tasks$get("iris")
instance <- TuningInstance$new(
task = task,
learner = glrn,
resampling = cv10,
measures = msr("classif.ce"),
param_set = ps,
terminator = term("evals", n_evals = 20)
)
Initiate random search:
tuner <- TunerRandomSearch$new()
tuner$tune(instance)
instance$result
Perhaps this can also be done with tidymodels hover I have yet to try them.
EDIT: to answer questions in the comments.
In order to fully grasp mlr3 I advise you to read the book as well as tutorials for each of the accessory packages.
In the above example the number of PCA dimensions retained for each of the data sets was tuned jointly with the cp
hyper-parameter. This was defined in this line:
ps <- ParamSet$new(list(
ParamDbl$new("classif.rpart.cp", lower = 0, upper = 1),
ParamInt$new("pca1.rank.", lower = 1, upper = 2),
ParamInt$new("pca2.rank.", lower = 1, upper = 2)
))
So for pca1, the algorithm could pick 1 or 2 pc to retain (I set it that way since there are only two features in each data set)
If you do not want to tune the number of dimensions in order to optimize performance then you could define the pipeop
like this:
pos1 %>>%
mlr_pipeops$get("scale", id = "scale1") %>>%
mlr_pipeops$get("pca", id = "pca1", param_vals = list(rank. = 1)) -> pr1
in that case you should omit it from the parameter set:
ps <- ParamSet$new(list(
ParamDbl$new("classif.rpart.cp", lower = 0, upper = 1)
))
As far as I know the variance explained can not be tweaked currently just the number of retained dimensions for pca transformation.
To change the predict type one can define a learner:
learner <- mlr_pipeops$get("learner",
learner = mlr_learners$get("classif.rpart"))
and set the predict type:
learner$learner$predict_type <- "prob"
and then create the graph:
graph <- piper %>>%
learner
To acquire performance for each hyper parameter combination:
instance$archive(unnest = "params")
To acquire predictions for each hyper parameter combination:
lapply(as.list(instance$archive(unnest = "params")[,"resample_result"])$resample_result,
function(x) x$predictions())
To acquire predictions for best hyper-parameter combination:
instance$best()$predictions()
If you would like it in the form of a data frame:
do.call(rbind,
lapply(instance$best()$predictions(),
function(x) data.frame(x$data$tab,
x$data$prob)))
probably there are some accessory functions that make this easier I just haven't played enough.