How to specify covariates in a regression model

2019-02-25 20:32发布

问题:

The dataset I would like to analyse looks like this

n <- 4000
tmp <- t(replicate(n, sample(49,6)))
dat <- matrix(0, nrow=n, ncol=49)
colnames(dat) <- paste("p", 1:49, sep="")
dat <- as.data.frame(dat)
dat[, "win.frac"] <- rnorm(n, mean=0.0176504, sd=0.002)
for (i in 1:nrow(dat)) 
  for (j in 1:6) dat[i, paste("p", tmp[i, j], sep="")] <- 1
str(dat)

Now I would like to perform a regression with depended variable win.frac and all other variables (p1, ..., p49) as explanatory variables.

However, with all approaches I tried I get the coefficient for p49 as NA, with the message "1 not defined because of singularities". I tried

modspec <- paste("win.frac ~", paste("p", 1:49, sep="", collapse=" + "))
fit1 <- lm(as.formula(modspec), data=dat)
fit2 <- lm(win.frac ~ ., data=dat)

Interestingly, the regression works if I use 48 explanatory variables. This may (p2, ..., p49) or may not (p1, ..., p48) contain the p49, hence I think this is not related to the variable p49 itself. I also tried larger values of n, with the same result.

I also tried betareg from the betareg package, since win.frac is restricted between 0 and 1. The regression in this case fails too, with the error message (roughly translated) "error in optim(...): non-finite value of optim specified"

library(betareg)
fit3 <- betareg(as.formula(modspec), data=dat, link="log")

Now I am stuck. How can I perform this regression? Is there a maximum of variables? Is this problem due to the fact that the explanatory variables are either 0 or 1?

Any hint is very appreciated!

回答1:

I assume that those are dummy encoded factor variables.

If you do the following you can see that you get a perfect fit if you try to model one of your regressors with all others:

regressormod <- lm(p49 ~ . - win.frac, data = dat)
summary(regressormod)$r.sq
#[1] 1

It's (mathematically) impossible to include all coeffcients from dummy-encoded factor variables in a regression model that also includes an intercept (see this answer on Cross Validated). That's why R excludes one factor level by default if you let it do the dummy encoding for you.



标签: r regression