I have carefully read the CARET documentation at: http://caret.r-forge.r-project.org/training.html, the vignettes, and everything is quite clear (the examples on the website help a lot!), but I am still a confused about the relationship between two arguments to trainControl
:
method
index
and the interplay between trainControl
and the data splitting functions in caret (e.g. createDataPartition
, createResample
, createFolds
and createMultiFolds
)
To better frame my questions, let me use the following example from the documentation:
data(BloodBrain)
set.seed(1)
tmp <- createDataPartition(logBBB,p = .8, times = 100)
trControl = trainControl(method = "LGOCV", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)
My questions are:
If I use
createDataPartition
(which I assume that does stratified bootstrapping), as in the above example, and I pass the result asindex
totrainControl
do I need to useLGOCV
as the method in my calltrainControl
? If I use another one (e.g.cv
) What difference would it make? In my head, once you fixindex
, you are essentially choosing the type of cross-validation, so I am not sure what rolemethod
plays if you useindex
.What is the difference between
createDataPartition
andcreateResample
? Is it thatcreateDataPartition
does stratified bootstrapping, whilecreateResample
doesn't?
3) How can I do stratified k-fold (e.g. 10 fold) cross validation using caret? Would the following do it?
tmp <- createFolds(logBBB, k=10, list=TRUE, times = 100)
trControl = trainControl(method = "cv", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)