Is cv.glmnet overfitting the the data by using the

cv.glmnet has been used by most research papers and companies. While building a similar function like cv.glmnet for glmnet.cr (a similar package that implements the lasso for continuation ratio ordinal regression) I came across this problem in cv.glmnet.

`cv.glmnet` first fits the model:



glmnet.object = glmnet(x, y, weights = weights, offset = offset, 
                     lambda = lambda, ...)

After the glmnet object is created with the complete data, the next step goes as follows: The lambda from the complete model fitted is extracted

lambda = glmnet.object$lambda

Now they make sure number of folds is more than 3

if (nfolds < 3) 
stop("nfolds must be bigger than 3; nfolds=10 recommended")

A list is created to store cross validated results

outlist = as.list(seq(nfolds))

A for loop runs to fit different data parts per the theory of cross-validation

  for (i in seq(nfolds)) {
    which = foldid == i
    if (is.matrix(y)) 
      y_sub = y[!which, ]
    else y_sub = y[!which]
    if (is.offset) 
      offset_sub = as.matrix(offset)[!which, ]
    else offset_sub = NULL
#using the lambdas for the complete data 
    outlist[[i]] = glmnet(x[!which, , drop = FALSE], 
                          y_sub, lambda = lambda, offset = offset_sub, 
                          weights = weights[!which], ...)
  }
}

So what happens. After fitting the data to the complete data, cross-validation is done, with lambdas from the complete data. Can someone tell me how this can possibly not be data over-fitting?. We in cross-validation want the model to have no information about the left out part of the data. But cv.glmnet cheats on this!

标签： r statistics cross-validation glmnet

2条回答

叼着烟拽天下

2楼-- · 2019-03-31 21:06

You're correct that using a cross-validated measure of fit to pick the "best" value of a tuning parameter introduces an optimistic bias into that measure when viewed as an estimate of the out-of-sample performance of the model with that "best" value. Any statistic has a sampling variance. But to talk of over-fitting seems to imply that optimization over the tuning parameter results in a degradation of out-of-sample performance compared to keeping it at a pre-specified value (say zero). That's unusual, in my experience—the optimization is very constrained (over a single parameter) compared to many other methods of feature selection. In any case it's a good idea to validate the whole procedure, including the choice of tuning parameter, on a hold-out set, or with an outer cross-validation loop, or by bootstrapping. See Cross Validation (error generalization) after model selection.

0人赞添加讨论(0) 举报

Is cv.glmnet overfitting the the data by using the

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间