CARET. Relationship between data splitting and tra

I have carefully read the CARET documentation at: http://caret.r-forge.r-project.org/training.html, the vignettes, and everything is quite clear (the examples on the website help a lot!), but I am still a confused about the relationship between two arguments to trainControl:

method 
index

and the interplay between trainControl and the data splitting functions in caret (e.g. createDataPartition, createResample, createFolds and createMultiFolds)

To better frame my questions, let me use the following example from the documentation:

data(BloodBrain)
set.seed(1)
tmp <- createDataPartition(logBBB,p = .8, times = 100)
trControl = trainControl(method = "LGOCV", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)

My questions are:

If I use createDataPartition (which I assume that does stratified bootstrapping), as in the above example, and I pass the result as index to trainControl do I need to use LGOCV as the method in my call trainControl? If I use another one (e.g. cv) What difference would it make? In my head, once you fix index, you are essentially choosing the type of cross-validation, so I am not sure what role method plays if you use index.
What is the difference between createDataPartition and createResample? Is it that createDataPartition does stratified bootstrapping, while createResample doesn't?

3) How can I do stratified k-fold (e.g. 10 fold) cross validation using caret? Would the following do it?

tmp <- createFolds(logBBB, k=10, list=TRUE,  times = 100)
trControl = trainControl(method = "cv", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)

标签： r machine-learning cross-validation

1条回答

SAY GOODBYE

2楼-- · 2020-06-09 04:42

If you are not sure what role method plays if you use index, why not to apply all the methods and compare results. It is a blind method of comparaison, but it can give you some intuitions.

  methods <- c('boot', 'boot632', 'cv', 
               'repeatedcv', 'LOOCV', 'LGOCV')

I create my index:

  n <- 100
  tmp <- createDataPartition(logBBB,p = .8, times = n)

I apply trainControl for my list of method, and I remove index from result since it is common to all my methods.

ll <- lapply(methods,function(x)
         trControl = trainControl(method = x, index = tmp))
ll <- sapply(ll,'[<-','index', NULL)

Hence my ll is :

                 [,1]      [,2]      [,3]      [,4]         [,5]      [,6]     
method            "boot"    "boot632" "cv"      "repeatedcv" "LOOCV"   "LGOCV"  
number            25        25        10        10           25        25       
repeats           25        25        1         1            25        25       
verboseIter       FALSE     FALSE     FALSE     FALSE        FALSE     FALSE    
returnData        TRUE      TRUE      TRUE      TRUE         TRUE      TRUE     
returnResamp      "final"   "final"   "final"   "final"      "final"   "final"  
savePredictions   FALSE     FALSE     FALSE     FALSE        FALSE     FALSE    
p                 0.75      0.75      0.75      0.75         0.75      0.75     
classProbs        FALSE     FALSE     FALSE     FALSE        FALSE     FALSE    
summaryFunction   ?         ?         ?         ?            ?         ?        
selectionFunction "best"    "best"    "best"    "best"       "best"    "best"   
preProcOptions    List,3    List,3    List,3    List,3       List,3    List,3   
custom            NULL      NULL      NULL      NULL         NULL      NULL     
timingSamps       0         0         0         0            0         0        
predictionBounds  Logical,2 Logical,2 Logical,2 Logical,2    Logical,2 Logical,2

0人赞添加讨论(0) 举报

CARET. Relationship between data splitting and tra

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间