I want to fit models and pull out specific parameters split by grouping factors (fac1 and fac2 below) or subsets. My problem is that when sapply outputs the correct parameters, I'm stuck with a list where the elements are named as combinations. What I want to get is a data.frame where I have a column for each factor with the appropriate label. I want to do this in base R.
Notice, the answer needs to be general and not for the specific names used in this case. The answer shouldn't be hindered if factor names include 'periods.' I'm eventually making something to use with any data, so this answer needs to do so, and also with any number of factors. I am actually using a custom function on a much larger data set but this example represents my issue.
Following is reproducible code:
#create data
fac1 <- c(rep("A", 10), rep("B",10))
fac2 <- rep(c(rep("X", 5), rep("Y",5)),2)
x <- rep(1:5,4)
set.seed(1337)
y <- rep(seq(2, 10, 2), 4) * runif(20, .8, 1.2)
xy <- data.frame(x,y) #bind parameters for regression
factors <- list(fac1, fac2) #split by 2 factors
sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2])
#run regression by these 4 groups, pull out slope
The output is:
A.X.c$x B.X.c$x A.Y.c$x B.Y.c$x
1.861290 2.131431 1.590733 1.746169
What I want is:
fac1 fac2 slope
A X 1.861290
B X 2.131431
A Y 1.590733
B Y 1.746169
The following code might be made to be more general to accomplish this, but I'm worried about cases where expand.grid makes all possible combinations but the user has missing combinations in their data, and also whether the order will stay the same. Does expand.grid use a similar method as however split subsets the data that determines the order of the returned values?
slopes <- sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2])
dataframeplz <- as.data.frame(expand.grid(unique(fac1), unique(fac2)))
dataframeplz$slope <- slopes
dataframeplz
Here is the plyr solution if that helps. It's so easy but not base R. Anyone know where in Hadley's code this magic happens? Githubbers?
library("plyr")
neatdata <- data.frame(fac1,fac2,x,y)
ddply(neatdata, c("fac1", "fac2"), function(c) coef(lm(c$y~c$x))[2])
A. Webb's answer is more elegant, but this
lapply/arbitrary function/do.call/rbind
workflow has been my last resort for this kind of thing for years:For base R,
aggregate
is the user friendly function for such situations.This could also be done with
by
in a fashion a bit more similar to your original.I used base R and I focused on your specific example. This process has limitations as it handles column names as strings and keeps the appropriate info you need.
I've updated it to something more general: