I am transitioning from Stata to R. In Stata, if I label a factor levels (say--0 and 1) to (M and F), 0 and 1 would remain as they are. Moreover, this is required for dummy-variable linear regression in most software including Excel and SPSS.
However, I've noticed that R defaults factor levels to 1,2 instead of 0,1. I don't know why R does this although regression internally (and correctly) assumes 0 and 1 as the factor variable. I would appreciate any help.
Here's what I did:
Try #1:
sex<-c(0,1,0,1,1)
sex<-factor(sex,levels = c(1,0),labels = c("F","M"))
str(sex)
Factor w/ 2 levels "F","M": 2 1 2 1 1
It seems that factor levels are now reset to 1 and 2. I believe 1 and 2s are references to the factor level here. However, I have lost the original values i.e. 0s and 1s.
Try2:
sex<-c(0,1,0,1,1)
sex<-factor(sex,levels = c(0,1),labels = c("F","M"))
str(sex)
Factor w/ 2 levels "F","M": 1 2 1 2 2
Ditto. My 0's and 1's are now 1's and 2's. Quite Surprising. Why is this happening.
Try3
Now, I wanted to see whether 1s and 2s have any bad effect regression. So, here's what I did:
Here's what my data looks like:
> head(data.frame(sassign$total_,sassign$gender))
sassign.total_ sassign.gender
1 357 M
2 138 M
3 172 F
4 272 F
5 149 F
6 113 F
myfit<-lm(sassign$total_ ~ sassign$gender)
myfit$coefficients
(Intercept) sassign$genderM
200.63522 23.00606
So, it turns out that the means are correct. While running the regression, R did use 0 and 1 value as dummies.
I did check other threads on SO, but they mostly talk about how R codes factor variables without telling me why. Stata and SPSS generally require the base variable to be "0." So, I thought of asking about this.
I'd appreciate any thoughts.
In short, you are just mixing up two different concepts. I will clarify them one by one in the following.
The meaning of integers you see in str()
What you see from str()
is the internal representation of a factor variable. A factor is internally an integer, where the number gives the position of levels inside the vector. For example:
x <- gl(3, 2, labels = letters[1:3])
#[1] a a b b c c
#Levels: a b c
storage.mode(x) ## or `typeof(x)`
#[1] "integer"
str(x)
# Factor w/ 3 levels "a","b","c": 1 1 2 2 3 3
as.integer(x)
#[1] 1 1 2 2 3 3
levels(x)
#[1] "a" "b" "c"
A common use of such positions, is to perform as.character(x)
in the most efficient way:
levels(x)[x]
#[1] "a" "a" "b" "b" "c" "c"
Your misunderstanding of what a model matrix looks like
It seems to me that you thought a model matrix is obtained by
cbind(1L, as.integer(x))
# [,1] [,2]
#[1,] 1 1
#[2,] 1 1
#[3,] 1 2
#[4,] 1 2
#[5,] 1 3
#[6,] 1 3
which is not true. In this fashion, you are just treating a factor variable as a numerical variable.
The model matrix is constructed this way:
xlevels <- levels(x)
cbind(1L, match(x, xlevels[2], nomatch=0), match(x, xlevels[3], nomatch=0))
# [,1] [,2] [,3]
#[1,] 1 0 0
#[2,] 1 0 0
#[3,] 1 1 0
#[4,] 1 1 0
#[5,] 1 0 1
#[6,] 1 0 1
The 1
and 0
implies "match" / "occurrence" and "no-match" / "no-occurrence", respectively.
The R routine model.matrix
will do this for you efficiently, with easy-to-read column names and row names:
model.matrix(~x)
# (Intercept) xb xc
#1 1 0 0
#2 1 0 0
#3 1 1 0
#4 1 1 0
#5 1 0 1
#6 1 0 1
Write an R function to produce a model matrix ourselves
We could write a nominal routine mm
to generate a model matrix. Though it is much less efficient than model.matrix
, it may help one digest this concept better.
mm <- function (x, contrast = TRUE) {
xlevels <- levels(x)
lst <- lapply(xlevels, function (z) match(x, z, nomatch = 0L))
if (contrast) do.call("cbind", c(list(1L), lst[-1]))
else do.call("cbind", lst)
}
For example, if we have a factor y
with 5 levels:
set.seed(1); y <- factor(sample(1:5, 10, replace=TRUE), labels = letters[1:5])
y
# [1] b b c e b e e d d a
#Levels: a b c d e
str(y)
#Factor w/ 5 levels "a","b","c","d",..: 2 2 3 5 2 5 5 4 4 1
Its model matrix with / without contrast treatment is respectively:
mm(y, TRUE)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 1 0 0 0
# [2,] 1 1 0 0 0
# [3,] 1 0 1 0 0
# [4,] 1 0 0 0 1
# [5,] 1 1 0 0 0
# [6,] 1 0 0 0 1
# [7,] 1 0 0 0 1
# [8,] 1 0 0 1 0
# [9,] 1 0 0 1 0
#[10,] 1 0 0 0 0
mm(y, FALSE)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 1 0 0 0
# [2,] 0 1 0 0 0
# [3,] 0 0 1 0 0
# [4,] 0 0 0 0 1
# [5,] 0 1 0 0 0
# [6,] 0 0 0 0 1
# [7,] 0 0 0 0 1
# [8,] 0 0 0 1 0
# [9,] 0 0 0 1 0
#[10,] 1 0 0 0 0
The corresponding model.matrix
call will be respectively:
model.matrix(~ y)
model.matrix(~ y - 1)
R is not Stata. And you will need to unlearn a lot of what you have been taught about dummy variable construction. R does it behind the scenes for you. You cannot make R behave exactly as Stata. True, R did have 0's and 1' in the model matrix column for the "F" level but those get multiplied by the factor values, (1 and 2 in this case). However, contrasts are always about differences and the difference btwn (0,1) is the same as the difference btwn (1,2).
A data example:
dput(dat)
structure(list(total = c(357L, 138L, 172L, 272L, 149L, 113L),
gender = structure(c(2L, 2L, 1L, 1L, 1L, 1L), .Label = c("F",
"M"), class = "factor")), .Names = c("total", "gender"), row.names = c("1",
"2", "3", "4", "5", "6"), class = "data.frame")
These two regression models have different model matrices (model matrices are how R constructs its "dummy variables.
> myfit<-lm(total ~ gender, dat)
>
> myfit$coefficients
(Intercept) genderM
176.5 71.0
> dat$gender=factor(dat$gender, levels=c("M","F") )
> myfit<-lm(total ~ gender, dat)
>
> myfit$coefficients
(Intercept) genderF
247.5 -71.0
> model.matrix(myfit)
(Intercept) genderF
1 1 0
2 1 0
3 1 1
4 1 1
5 1 1
6 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$gender
[1] "contr.treatment"
> dat$gender=factor(dat$gender, levels=c("F","M") )
> myfit<-lm(total ~ gender, dat)
>
> myfit$coefficients
(Intercept) genderM
176.5 71.0
> model.matrix(myfit)
(Intercept) genderM
1 1 1
2 1 1
3 1 0
4 1 0
5 1 0
6 1 0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$gender
[1] "contr.treatment"