I am running Ridge regression with the use of glmnet
R
package. I noticed that the coefficients I obtain from glmnet::glmnet
function are different from those I get by computing coefficients by definition (with the use of the same lambda value). Could somebody explain me why?
Data (both: response Y
and design matrix X
) are scaled.
library(MASS)
library(glmnet)
# Data dimensions
p.tmp <- 100
n.tmp <- 100
# Data objects
set.seed(1)
X <- scale(mvrnorm(n.tmp, mu = rep(0, p.tmp), Sigma = diag(p.tmp)))
beta <- rep(0, p.tmp)
beta[sample(1:p.tmp, 10, replace = FALSE)] <- 10
Y.true <- X %*% beta
Y <- scale(Y.true + matrix(rnorm(n.tmp))) # Y.true + Gaussian noise
# Run glmnet
ridge.fit.cv <- cv.glmnet(X, Y, alpha = 0)
ridge.fit.lambda <- ridge.fit.cv$lambda.1se
# Extract coefficient values for lambda.1se (without intercept)
ridge.coef <- (coef(ridge.fit.cv, s = ridge.fit.lambda))[2:(p.tmp+1)]
# Get coefficients "by definition"
ridge.coef.DEF <- solve(t(X) %*% X + ridge.fit.lambda * diag(p.tmp)) %*% t(X) %*% Y
# Plot estimates
plot(ridge.coef, type = "l", ylim = range(c(ridge.coef, ridge.coef.DEF)),
main = "black: Ridge `glmnet`\nred: Ridge by definition")
lines(ridge.coef.DEF, col = "red")
If you read
?glmnet
, you will see that the penalized objective function of Gaussian response is:In case the ridge penalty
1/2 * ||beta_j||_2^2
is used, we havewhich is proportional to
This is different to what we usually see in textbook regarding ridge regression:
The formula you write:
is for the textbook result; for
glmnet
we should expect:So, the textbook uses penalized least squares, but
glmnet
uses penalized mean squared error.Note I did not use your original code with
t()
,"%*%"
andsolve(A) %*% b
; usingcrossprod
andsolve(A, b)
is more efficient! See Follow-up section in the end.Now let's make a new comparison:
Note that I have set
intercept = FALSE
when I callcv.glmnet
(orglmnet
). This has more conceptual meaning than what it will affect in practice. Conceptually, our textbook computation has no intercept, so we want to drop intercept when usingglmnet
. But practically, since yourX
andY
are standardized, the theoretical estimate of intercept is 0. Even withintercepte = TRUE
(glment
default), you can check that the estimate of intercept is~e-17
(numerically 0), hence estimate of other coefficients is not notably affected. The other answer is just showing this.Follow-up
t(X) %*% Y
will first take transposeX1 <- t(X)
, then doX1 %*% Y
, whilecrossprod(X, Y)
will not do the transpose."%*%"
is a wrapper forDGEMM
for caseop(A) = A, op(B) = B
, whilecrossprod
is a wrapper forop(A) = A', op(B) = B
. Similarlytcrossprod
forop(A) = A, op(B) = B'
.A major use of
crossprod(X)
is fort(X) %*% X
; similarly thetcrossprod(X)
forX %*% t(X)
, in which caseDSYRK
instead ofDGEMM
is called. You can read the first section of Why the built-in lm function is so slow in R? for reason and a benchmark.Be aware that if
X
is not a square matrix,crossprod(X)
andtcrossprod(X)
are not equally fast as they involve different amount of floating point operations, for which you may read the side notice of Any faster R function than “tcrossprod” for symmetric dense matrix multiplication?Regarding
solvel(A, b)
andsolve(A) %*% b
, please read the first section of How to compute diag(X %% solve(A) %% t(X)) efficiently without taking matrix inverse?Adding on top of Zheyuan's interesting post, did some more experiments to see that we can get the same results with intercept as well, as follows: