I have a data.frame
consisting of numeric and factor variables as seen below.
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))
I want to build out a matrix
that assigns dummy variables to the factor and leaves the numeric variables alone.
model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame)
As expected when running lm
this leaves out one level of each factor as the reference level. However, I want to build out a matrix
with a dummy/indicator variable for every level of all the factors. I am building this matrix for glmnet
so I am not worried about multicollinearity.
Is there a way to have model.matrix
create the dummy for every level of the factor?
I am currently learning Lasso model and
glmnet::cv.glmnet()
,model.matrix()
andMatrix::sparse.model.matrix()
(for high dimensions matrix, usingmodel.matrix
will killing our time as suggested by the author ofglmnet
.).Just sharing there has a tidy coding to get the same answer as @fabians and @Gavin's answer. Meanwhile, @asdf123 introduced another package
library('CatEncoders')
as well.Source : R for Everyone: Advanced Analytics and Graphics (page273)
(Trying to redeem myself...) In response to Jared's comment on @Fabians answer about automating it, note that all you need to supply is a named list of contrast matrices.
contrasts()
takes a vector/factor and produces the contrasts matrix from it. For this then we can uselapply()
to runcontrasts()
on each factor in our data set, e.g. for thetestFrame
example provided:Which slots nicely into @fabians answer:
caret
implemented a nice functiondummyVars
to achieve this with 2 lines:library(caret) dmy <- dummyVars(" ~ .", data = testFrame) testFrame2 <- data.frame(predict(dmy, newdata = testFrame))
Checking the final columns:
The nicest point here is you get the original data frame, plus the dummy variables having excluded the original ones used for the transformation.
More info: http://amunategui.github.io/dummyVar-Walkthrough/
or
should be the most straightforward
F
Ok. Just reading the above and putting it all together. Suppose you wanted the matrix e.g. 'X.factors' that multiplies by your coefficient vector to get your linear predictor. There are still a couple extra steps:
(Note that you need to turn X[*] back into a data frame in case you have only one factor column.)
Then say you get something like this:
We want to get rid of the **'d reference levels of each factor
You need to reset the
contrasts
for the factor variables:or, with a little less typing and without the proper names: