Is cv.glmnet overfitting the the data by using the

2019-03-31 20:55发布

问题:

cv.glmnet has been used by most research papers and companies. While building a similar function like cv.glmnet for glmnet.cr (a similar package that implements the lasso for continuation ratio ordinal regression) I came across this problem in cv.glmnet.

`cv.glmnet` first fits the model:



glmnet.object = glmnet(x, y, weights = weights, offset = offset, 
                     lambda = lambda, ...)

After the glmnet object is created with the complete data, the next step goes as follows: The lambda from the complete model fitted is extracted

lambda = glmnet.object$lambda

Now they make sure number of folds is more than 3

if (nfolds < 3) 
stop("nfolds must be bigger than 3; nfolds=10 recommended")

A list is created to store cross validated results

outlist = as.list(seq(nfolds))

A for loop runs to fit different data parts per the theory of cross-validation

  for (i in seq(nfolds)) {
    which = foldid == i
    if (is.matrix(y)) 
      y_sub = y[!which, ]
    else y_sub = y[!which]
    if (is.offset) 
      offset_sub = as.matrix(offset)[!which, ]
    else offset_sub = NULL
#using the lambdas for the complete data 
    outlist[[i]] = glmnet(x[!which, , drop = FALSE], 
                          y_sub, lambda = lambda, offset = offset_sub, 
                          weights = weights[!which], ...)
  }
}

So what happens. After fitting the data to the complete data, cross-validation is done, with lambdas from the complete data. Can someone tell me how this can possibly not be data over-fitting?. We in cross-validation want the model to have no information about the left out part of the data. But cv.glmnet cheats on this!

回答1:

You're correct that using a cross-validated measure of fit to pick the "best" value of a tuning parameter introduces an optimistic bias into that measure when viewed as an estimate of the out-of-sample performance of the model with that "best" value. Any statistic has a sampling variance. But to talk of over-fitting seems to imply that optimization over the tuning parameter results in a degradation of out-of-sample performance compared to keeping it at a pre-specified value (say zero). That's unusual, in my experience—the optimization is very constrained (over a single parameter) compared to many other methods of feature selection. In any case it's a good idea to validate the whole procedure, including the choice of tuning parameter, on a hold-out set, or with an outer cross-validation loop, or by bootstrapping. See Cross Validation (error generalization) after model selection.



回答2:

No, this is not overfitting.

cv.glmnet() does build the entire solution path for the lambda sequence. But you never pick the last entry in that path. You typically pick lambda==lambda.1se (or lambda.min) , as @Fabians said:

lambda==lambda.min : is the lambda-value where cvm is minimized

lambda==lambda.1se : is the lambda-value where (cvm-cvsd)=cvlow is minimized. This is your optimal lambda

See the documentation for cv.glmnet() and coef(..., s='lambda.1se')