cv.glmnet
has been used by most research papers and companies. While building a similar function like cv.glmnet
for glmnet.cr
(a similar package that implements the lasso for continuation ratio ordinal regression) I came across this problem in cv.glmnet
.
`cv.glmnet` first fits the model:
glmnet.object = glmnet(x, y, weights = weights, offset = offset,
lambda = lambda, ...)
After the glmnet
object is created with the complete data, the next step goes as follows:
The lambda
from the complete model fitted is extracted
lambda = glmnet.object$lambda
Now they make sure number of folds is more than 3
if (nfolds < 3)
stop("nfolds must be bigger than 3; nfolds=10 recommended")
A list is created to store cross validated results
outlist = as.list(seq(nfolds))
A for loop
runs to fit different data parts per the theory of cross-validation
for (i in seq(nfolds)) {
which = foldid == i
if (is.matrix(y))
y_sub = y[!which, ]
else y_sub = y[!which]
if (is.offset)
offset_sub = as.matrix(offset)[!which, ]
else offset_sub = NULL
#using the lambdas for the complete data
outlist[[i]] = glmnet(x[!which, , drop = FALSE],
y_sub, lambda = lambda, offset = offset_sub,
weights = weights[!which], ...)
}
}
So what happens. After fitting the data to the complete data, cross-validation is done, with lambdas from the complete data. Can someone tell me how this can possibly not be data over-fitting?. We in cross-validation want the model to have no information about the left out part of the data. But cv.glmnet
cheats on this!