I have a data frame of 96 observations (patients) and 1098 variables (genes). The response is binary (Y and N) and the predictors are numeric. I am trying to perform leave-one-out cross validation, but my interest is not standard error, but the p-values for each variable from each of the 95 logistic regression models created from LOOCV. These are my attempts thus far:
#Data frame 96 observations 1098 variables
DF2
fit <- list()
for (i in 1:96){
df <- DF2[-i,]
fit[[i]] <- glm (response ~., data= df, family= "binomial")
}
model_pvalues <- data.frame(model = character(), p_value = numeric())
This outputs fit as a large list with 16 elements and a list of 30: $coefficients, $residuals, $fitted.values....
Attempt 1:
for (i in length(fit)){
model_pvalues <- rbind(model_pvalues, coef(summary(fit[[i]])))
}
This output into "model_pvalues" 95 observations (Intercept and 94 variables) and 4 variables: Estimate, Std. Error, z value, Pr(>|z|). However what I am really trying to get is the p-value for all 1097 variables, for 95 models constructed by leave one out cross validation.
Attempt 2:
for (i in length(fit)){
model_pvalues <- rbind(model_pvalues, coef(summary(fit[[i]]))[4])
}
When I ran this I get out one number (not sure from where, assuming a beta) for one variable.
Attempt 3:
for (i in 1:96){
df <- DF2[-i,]
fit[[i]] <- glm (response ~., data= df, family= "binomial")
model_pvalues <- rbind(model_pvalues, coef(summary(fit[[i]])))
}
When I run this I get out a data frame of 1520 observations of 4 variables: Estimate, Std. Error, z value, Pr(>|z|). The observations begin with (Intercept) followed by 82 variables. After that it repeats this pattern with (Intercept1) and the same 82 variables, up until (Intercept15).
So my end goal is to create 95 models via LOOCV and to get the p-values for all 1097 variables used in all models. Any help would be very much appreciated!
Edit: example data (real DF 96 observations for 1098 variables)
Response X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
P1 N 1 1 1 0 1 0 1 0 2 2
P2 N 2 1 1 0 2 2 1 2 2 2
P3 N 2 1 2 1 1 0 1 1 0 1
P4 Y 1 1 2 0 1 0 0 1 1 1
P5 N 2 2 1 1 1 0 0 0 1 1
P6 N 2 1 2 1 1 0 0 0 2 1
P7 Y 2 1 1 0 2 0 0 0 2 0
P8 Y 2 1 1 0 2 0 0 1 0 2
P9 N 1 1 1 0 2 0 0 0 1 0
P10 N 2 1 2 1 1 0 1 0 0 2
For
n
observations (96 for your real data, 10 in the example data) andp
variables (1098 for your real data, 10 in the example data), the code below should extract ap
row byn
column matrix of p-values. I feel obliged to warn you that trying to fit ann<<p
case (very few observations relative to the number of parameters) is likely to have extremely poor statistical properties, and maybe even be impossible, unless you use a technique like penalized regression ... this is also probably the reason why so many of your parameters are missing from the estimates (i.e. you're only getting 94 out of a possible 1097 variables) - especially since your expression patterns are simple (only 0, 1, or 2), a large number of the parameters are collinear and can't be jointly estimated (you should have seen a lot ofNA
s in your original model fit, too).Get example data:
Fit models
In this case we have to be a little bit careful extracting the p-values because, due to collinearity, some of them are missing - R leaves an
NA
in the coefficient vector (coef()
) for non-estimated parameters, but doesn't similarly fill in rows of the coefficient table in the summary.Of course, for the toy example, all of the p-values are essentially equal to 1 ...