I am trying to convert a data frame with categorical variables to a model.matrix but am losing levels of variables.
Here's my code:
df1 <- data.frame(id = 1:200, y =rbinom(200, 1, .5), var1 = factor(rep(c('abc','def','ghi','jkl'),50)))
df1$var2 <- factor(rep(c('ab c','ghi','jkl','def'),50))
df1$var3 <- factor(rep(c('abc','ghi','nop','xyz'),50))
df1$var2 <- as.character(df1$var2)
df1$var2 <- gsub('\\s','',df1$var2)
df1$var2 <- factor(df1$var2)
sapply(df1, levels)
mm1 <- model.matrix(~ 0+.,df1)
head(mm1)
Any suggestions? Is this a matrix non-invertability issue?
The model matrix is perfectly correct. For factors, the model matrix contains one column less than there are factors: this information is already contained in the (Intercept)
column. You are missing this column because you have specified +0
in your model term. Try this:
mm2 <- model.matrix(~., df1)
head(mm2)
You will now see the (Intercept)
column which encodes "default" information, and now also the first level of var1
is missing in the column names. The (Intercept)
represents your observation at the "reference level", which is the combination of first level of each categorical attribute. Any deviation from this reference level is encoded in the var*???
columns, and since your model assumes no interactions between these columns, you get (4 - 1) * 3 var*???
columns plus the (Intercept)
column (which is replaced by var1abc
in your initial model matrix).
Unfortunately I lack the precise terms to describe this. Anyone help me out?