The dataset I would like to analyse looks like this
n <- 4000
tmp <- t(replicate(n, sample(49,6)))
dat <- matrix(0, nrow=n, ncol=49)
colnames(dat) <- paste("p", 1:49, sep="")
dat <- as.data.frame(dat)
dat[, "win.frac"] <- rnorm(n, mean=0.0176504, sd=0.002)
for (i in 1:nrow(dat))
for (j in 1:6) dat[i, paste("p", tmp[i, j], sep="")] <- 1
str(dat)
Now I would like to perform a regression with depended variable win.frac
and all other variables (p1
, ..., p49
) as explanatory variables.
However, with all approaches I tried I get the coefficient for p49
as NA, with the message "1 not defined because of singularities". I tried
modspec <- paste("win.frac ~", paste("p", 1:49, sep="", collapse=" + "))
fit1 <- lm(as.formula(modspec), data=dat)
fit2 <- lm(win.frac ~ ., data=dat)
Interestingly, the regression works if I use 48 explanatory variables. This may (p2, ..., p49) or may not (p1, ..., p48) contain the p49, hence I think this
is not related to the variable p49 itself. I also tried larger values of n
, with the same result.
I also tried betareg
from the betareg
package, since win.frac
is restricted between 0 and 1. The regression in this case fails too, with the error message (roughly translated) "error in optim(...): non-finite value of optim specified"
library(betareg)
fit3 <- betareg(as.formula(modspec), data=dat, link="log")
Now I am stuck. How can I perform this regression? Is there a maximum of variables? Is this problem due to the fact that the explanatory variables are either 0 or 1?
Any hint is very appreciated!