cv.glmnet
has been used by most research papers and companies. While building a similar function like cv.glmnet
for glmnet.cr
(a similar package that implements the lasso for continuation ratio ordinal regression) I came across this problem in cv.glmnet
.
`cv.glmnet` first fits the model:
glmnet.object = glmnet(x, y, weights = weights, offset = offset,
lambda = lambda, ...)
After the glmnet
object is created with the complete data, the next step goes as follows:
The lambda
from the complete model fitted is extracted
lambda = glmnet.object$lambda
Now they make sure number of folds is more than 3
if (nfolds < 3)
stop("nfolds must be bigger than 3; nfolds=10 recommended")
A list is created to store cross validated results
outlist = as.list(seq(nfolds))
A for loop
runs to fit different data parts per the theory of cross-validation
for (i in seq(nfolds)) {
which = foldid == i
if (is.matrix(y))
y_sub = y[!which, ]
else y_sub = y[!which]
if (is.offset)
offset_sub = as.matrix(offset)[!which, ]
else offset_sub = NULL
#using the lambdas for the complete data
outlist[[i]] = glmnet(x[!which, , drop = FALSE],
y_sub, lambda = lambda, offset = offset_sub,
weights = weights[!which], ...)
}
}
So what happens. After fitting the data to the complete data, cross-validation is done, with lambdas from the complete data. Can someone tell me how this can possibly not be data over-fitting?. We in cross-validation want the model to have no information about the left out part of the data. But cv.glmnet
cheats on this!
You're correct that using a cross-validated measure of fit to pick the "best" value of a tuning parameter introduces an optimistic bias into that measure when viewed as an estimate of the out-of-sample performance of the model with that "best" value. Any statistic has a sampling variance. But to talk of over-fitting seems to imply that optimization over the tuning parameter results in a degradation of out-of-sample performance compared to keeping it at a pre-specified value (say zero). That's unusual, in my experience—the optimization is very constrained (over a single parameter) compared to many other methods of feature selection. In any case it's a good idea to validate the whole procedure, including the choice of tuning parameter, on a hold-out set, or with an outer cross-validation loop, or by bootstrapping. See Cross Validation (error generalization) after model selection.
No, this is not overfitting.
cv.glmnet()
does build the entire solution path for the lambda sequence. But you never pick the last entry in that path. You typically picklambda==lambda.1se
(orlambda.min
) , as @Fabians said:See the documentation for
cv.glmnet()
andcoef(..., s='lambda.1se')