Regression of a Data Frame with multiple factor gr

2019-09-05 14:57发布

问题:

I am working on a regression script. I have a data.frame with roughly 130 columns, of which I need to do a regression for one column (lets call it X column) against all the other ~100 numeric columns.

Before the regression is calculated, I need to group the data by 4 factors: myDat$Recipe, myDat$Step, myDat$Stage, and myDat$Prod while still keeping the other ~100 columns and row data attached for the regression. Then I need to do a regression of each column ~ X column and print out the R^2 value with the column name. This is what I've tried so far but it is getting overly complicated and I know there's got to be a better way.

 rm(list=ls())
 myDat <- read.csv(file="C:/Users/Documents/myDat.csv",              header=TRUE, sep=",")

for(j in myDat$Recipe)
{
  myDatj <- subset(myDat, myDat$Recipe == j) 
  for(k in myDatj$Step)
  {
    myDatk <- subset(myDatj, myDatj$Step == k) 
    for(i in myDatk$Stage)
    {
      myDati <- subset(myDatk, myDatk$Stage == i)
      for(m in myDati$Prod)
      {
        myDatm <- subset(myDati, myDati$Prod == m)
          if(is.numeric(myDatm[3,i]))  
          {     
          fit <- lm(myDatk[,i] ~ X, data=myDatm) 
          rsq <- summary(fit)$r.squared
            {
              writeLines(paste(rsq,i,"\n"))
           }  
         }
      }
    }
  }  
}      

回答1:

You can do this by combining dplyr, tidyr and my broom package (you can install them with install.packages). First you need to gather all the numeric columns into a single column:

library(dplyr)
library(tidyr)
tidied <- myDat %>%
    gather(column, value, -X, -Recipe, -Step, -Stage, -Prod)

To understand what this does, you can read up on tidyr's gather operation. (This assumes that all columns besides X, Recipe, Step, Stage, and Prod are numeric and therefore should be predicted in your regression. If that's not the case, you need to remove them beforehand. You'll need to produce a reproducible example of the problem if you need a more customized solution).

Then perform each regression, while grouping by the column and the four grouping variables.

library(broom)

regressions <- tidied %>%
    group_by(column, Recipe, Step, Stage, Prod) %>%
    do(mod = lm(value ~ X))

glances <- regressions %>% glance(mod)

The resulting glances data frame will have one row for each combination of column, Recipe, Step, Stage, and Prod, along with an r.squared column containing the R-squared from each model. (It will also contain adj.r.squared, along with other columns such as F-test p-value: see here for more). Running coefs <- regressions %>% tidy(mod) will probably also be useful for you, as it will get the coefficient estimates and p-values from each regression.

A similar use case is described in the "broom and dplyr" vignette, and in Section 3.1 of the broom manuscript.