This question already has an answer here:
-
Linear Regression and group by in R
10 answers
I am new to R and am trying to run a linear regression on multiple subsets ("Cases") of data in a single file. I have 50 different cases, so I don't want to have to run 50 different regressions...be nice to automate this. I have found and experimented with the ddply
method, but this, for some reason, returns the same coefficients to me for each case. Code I'm using is as follows:
ddply(MyData, "Case", function(x) coefficients(lm(Y~X1+X2+X3, MyData)))
Results I get, again, are the same coefficients for each "Case". Any ideas on how I can improve my code so that the regression runs once for each case and gives me unique coefficients for each case?
ddply
passes data.frames (from splitting the input data.frame) to the function. You probably want this:
ddply(MyData, "Case", function(df) coefficients(lm(Y~X1+X2+X3, data=df)))
(Not tested since you don't provide a reproducible example.)
You passed the whole input data.frame to lm
for each group..
You can also do it using only base functions:
# load data
data(warpbreaks)
# fit linear model to each subset
fits <- by(warpbreaks, warpbreaks[,"tension"],
function(x) lm(breaks ~ wool, data = x))
# Combine coefficients from each model
do.call("rbind", lapply(fits, coef))
Though you can get pretty far with (d)plyr, I get the most flexibility with a simple for loop (as I’m not too confident with the do() when it comes to wanting more than just the output of coefficients(), for example)
Let’s start by loading the data and loading some packages:
library(dplyr)
library(broom) # get lm output (coefficients) as a dataframe
library(reshape2) # Melt / DCAST for swapping between wide / long format
data(warpbreaks)
Next, create a list of groups we’re going to be iterating over.
GROUPS <- unique(warpbreaks$tension) # Specify groups, one unique model per group.
The code to answer your question, in this case, would be something of the sorts:
for (i in 1:length(GROUPS)){
CURRENT_GROUP <- GROUPS[i]
df <- filter(warpbreaks, tension==CURRENT_GROUP) # subset the dataframe
fit <- lm(breaks ~ wool, data = df) # Build a model
coeff <- tidy(fit) # Get a pretty data frame of the coefficients & p values
coeff <- coeff[,c(1,2,5)] # Extract P.Value & Estimate
# Rename (intercept) to INT for pretty column names
coeff[coeff$term=="(Intercept)", ]$term <- "INT"
# Make it into wide format with reshape2 package.
coeff <- coeff %>% melt(id.vars=c("term"))
# Defactor the resulting data.frame
coeff <- mutate(coeff,
variable=as.character(variable),
term=as.character(term))
# Rename for prettier column names later
coeff[coeff$variable=="estimate", ]$variable <- "Beta"
coeff[coeff$variable=="p.value", ]$variable <- "P"
coeff <- dcast(coeff, .~term+variable)[,-1]
rsquared <- summary(fit)$r.squared
# Create a df out of what we just did.
row <- cbind(
data.frame(
group=CURRENT_GROUP,
rsquared=rsquared),
coeff
)
# If first iteration, create data.frame -- otherwise: rowbind
if (i==1){
RESULT_ROW = row
}
else{
RESULT_ROW = rbind(RESULT_ROW, row)
} # End if.
} # End for loop
When writing the inner for loop code, just test something out by simulating the loop (first run i <- 1, run the inner loop code step by step, then run i <- 2 etc.).
The resulting dataframe:
RESULT_ROW
## group rsquared INT_Beta INT_P woolB_Beta woolB_P
## 1 L 0.26107601 44.55556 9.010046e-08 -16.333333 0.03023435
## 2 M 0.07263228 24.00000 5.991177e-07 4.777778 0.27947884
## 3 H 0.12666292 24.55556 9.234773e-08 -5.777778 0.14719571
I’m not saying this is better than the answers described above, I just like the fact that this gives me the flexibility to extract anything from any model (not just lm models that have a pretty coefficients() function).