In my dataset I have a number of continuous and dummy variables. For analysis with glmnet, I want the continuous variables to be standardized but not the dummy variables.
I currently do this manually by first defining a dummy vector of columns that have only values of [0,1] and then using the scale
command on all the non-dummy columns. Problem is, this isn't very elegant.
But glmnet has a built in standardize
argument. By default will this standardize the dummies too? If so, is there an elegant way to tell glmnet's standardize
argument to skip dummies?
glmnet
doesn't know anything about dummy variables, because it doesn't have a formula interface (and hence doesn't touchmodel.frame
andmodel.matrix
.) If you want them to be treated specially, you'll have to do it yourself.In short, yes - this will standardize the dummy variables, but there's a reason for doing so. The
glmnet
function takes a matrix as an input for itsX
parameter, not a data frame, so it doesn't make the distinction forfactor
columns which you may have if the parameter was adata.frame
. If you take a look at the R function, glmnet codes thestandardize
parameter internally asWhich converts the R boolean to a 0 or 1 integer to feed to any of the internal FORTRAN functions (elnet, lognet, et. al.)
If you go even further by examining the FORTRAN code (fixed width - old school!), you'll see the following block:
Take a look at the lines marked 1000 - this is basically applying the standardization formula to the
X
matrix.Now statistically speaking, one does not generally standardize categorical variables to retain the interpretability of the estimated regressors. However, as pointed out by Tibshirani here, "The lasso method requires initial standardization of the regressors, so that the penalization scheme is fair to all regressors. For categorical regressors, one codes the regressor with dummy variables and then standardizes the dummy variables" - so while this causes arbitrary scaling between continuous and categorical variables, it's done for equal penalization treatment.