Getting p-values from leave-one-out in R

2019-07-13 01:56发布

问题:

I have a data frame of 96 observations (patients) and 1098 variables (genes). The response is binary (Y and N) and the predictors are numeric. I am trying to perform leave-one-out cross validation, but my interest is not standard error, but the p-values for each variable from each of the 95 logistic regression models created from LOOCV. These are my attempts thus far:

#Data frame 96 observations 1098 variables
DF2

fit <- list()

for (i in 1:96){
  df <- DF2[-i,]
 fit[[i]] <- glm (response ~., data= df, family= "binomial")
 }
 model_pvalues <- data.frame(model = character(), p_value = numeric())

This outputs fit as a large list with 16 elements and a list of 30: $coefficients, $residuals, $fitted.values....

Attempt 1:

for (i in length(fit)){ 
  model_pvalues <- rbind(model_pvalues, coef(summary(fit[[i]])))
}

This output into "model_pvalues" 95 observations (Intercept and 94 variables) and 4 variables: Estimate, Std. Error, z value, Pr(>|z|). However what I am really trying to get is the p-value for all 1097 variables, for 95 models constructed by leave one out cross validation.

Attempt 2:

for (i in length(fit)){ 
  model_pvalues <- rbind(model_pvalues, coef(summary(fit[[i]]))[4])
}

When I ran this I get out one number (not sure from where, assuming a beta) for one variable.

Attempt 3:

for (i in 1:96){
  df <- DF2[-i,]
  fit[[i]] <- glm (response ~., data= df, family= "binomial")
  model_pvalues <- rbind(model_pvalues, coef(summary(fit[[i]])))
}

When I run this I get out a data frame of 1520 observations of 4 variables: Estimate, Std. Error, z value, Pr(>|z|). The observations begin with (Intercept) followed by 82 variables. After that it repeats this pattern with (Intercept1) and the same 82 variables, up until (Intercept15).

So my end goal is to create 95 models via LOOCV and to get the p-values for all 1097 variables used in all models. Any help would be very much appreciated!

Edit: example data (real DF 96 observations for 1098 variables)

  Response  X1  X2  X3  X4  X5  X6  X7  X8  X9  X10

P1  N       1   1   1   0   1   0   1   0   2    2
P2  N       2   1   1   0   2   2   1   2   2    2
P3  N       2   1   2   1   1   0   1   1   0    1
P4  Y       1   1   2   0   1   0   0   1   1    1
P5  N       2   2   1   1   1   0   0   0   1    1
P6  N       2   1   2   1   1   0   0   0   2    1
P7  Y       2   1   1   0   2   0   0   0   2    0
P8  Y       2   1   1   0   2   0   0   1   0    2
P9  N       1   1   1   0   2   0   0   0   1    0
P10 N       2   1   2   1   1   0   1   0   0    2

回答1:

For n observations (96 for your real data, 10 in the example data) and p variables (1098 for your real data, 10 in the example data), the code below should extract a p row by n column matrix of p-values. I feel obliged to warn you that trying to fit an n<<p case (very few observations relative to the number of parameters) is likely to have extremely poor statistical properties, and maybe even be impossible, unless you use a technique like penalized regression ... this is also probably the reason why so many of your parameters are missing from the estimates (i.e. you're only getting 94 out of a possible 1097 variables) - especially since your expression patterns are simple (only 0, 1, or 2), a large number of the parameters are collinear and can't be jointly estimated (you should have seen a lot of NAs in your original model fit, too).

Get example data:

DF2 <- read.table(row.names=1,header=TRUE,text="
Resp. X1  X2  X3  X4  X5  X6  X7  X8  X9  X10
P1  N   1   1   1   0   1   0   1   0   2   2
P2  N   2   1   1   0   2   2   1   2   2   2
P3  N   2   1   2   1   1   0   1   1   0   1
P4  Y   1   1   2   0   1   0   0   1   1   1
P5  N   2   2   1   1   1   0   0   0   1   1
P6  N   2   1   2   1   1   0   0   0   2   1
P7  Y   2   1   1   0   2   0   0   0   2   0
P8  Y   2   1   1   0   2   0   0   1   0   2
P9  N   1   1   1   0   2   0   0   0   1   0
P10 N   2   1   2   1   1   0   1   0   0   2")

Fit models

n <- nrow(DF2)
fit <- vector(mode="list",n) ## best to pre-allocate objects
for (i in 1:n) {
  df <- DF2[-i,]
  fit[[i]] <- glm (Resp. ~., data= df, family= "binomial")
}

In this case we have to be a little bit careful extracting the p-values because, due to collinearity, some of them are missing - R leaves an NA in the coefficient vector (coef()) for non-estimated parameters, but doesn't similarly fill in rows of the coefficient table in the summary.

tmpf <- function(x) {
    ## extract coef vector - has NA values for collinear terms
    ## [-1] is to drop the intercept
    r1 <- coef(x)[-1]
    ## fill in values from p-value vector; leave out intercept with -1,
    r2 <- coef(summary(x))[-1,"Pr(>|z|)"]
    r1[names(r2)] <- r2
    return(r1)
}
pvals <- sapply(fit,tmpf)

Of course, for the toy example, all of the p-values are essentially equal to 1 ...

## round(pvals,4)
##       [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
## X1  0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9999 0.9998 0.9999 0.9998
## X2  0.9999 0.9999 0.9999 0.9999     NA 0.9999 0.9999 0.9999 0.9999 0.9999
## X3  0.9999 0.9999 0.9999 0.9999 0.9999 0.9998 0.9999 0.9999 0.9999 0.9999
## X4  0.9998 0.9998 0.9998     NA 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
## X5      NA 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000     NA 1.0000
## X6  0.9999     NA 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
## X7  1.0000 1.0000 1.0000 1.0000 1.0000     NA 1.0000 1.0000 1.0000 1.0000
## X8  1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
## X9  1.0000 1.0000     NA 1.0000 1.0000 1.0000     NA     NA 1.0000     NA
## X10     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA