fit model to multiple groupings or subsets and ext

2019-07-31 23:29发布

I want to fit models and pull out specific parameters split by grouping factors (fac1 and fac2 below) or subsets. My problem is that when sapply outputs the correct parameters, I'm stuck with a list where the elements are named as combinations. What I want to get is a data.frame where I have a column for each factor with the appropriate label. I want to do this in base R.

Notice, the answer needs to be general and not for the specific names used in this case. The answer shouldn't be hindered if factor names include 'periods.' I'm eventually making something to use with any data, so this answer needs to do so, and also with any number of factors. I am actually using a custom function on a much larger data set but this example represents my issue.

Following is reproducible code:

#create data
fac1 <- c(rep("A", 10), rep("B",10))
fac2 <- rep(c(rep("X", 5), rep("Y",5)),2)
x <- rep(1:5,4)
set.seed(1337)
y <- rep(seq(2, 10, 2), 4) * runif(20, .8, 1.2)

xy <- data.frame(x,y) #bind parameters for regression

factors <- list(fac1, fac2) #split by 2 factors

sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) 
#run regression by these 4 groups, pull out slope

The output is:

A.X.c$x  B.X.c$x  A.Y.c$x  B.Y.c$x 
1.861290 2.131431 1.590733 1.746169

What I want is:

fac1 fac2 slope
A    X    1.861290 
B    X    2.131431 
A    Y    1.590733 
B    Y    1.746169

The following code might be made to be more general to accomplish this, but I'm worried about cases where expand.grid makes all possible combinations but the user has missing combinations in their data, and also whether the order will stay the same. Does expand.grid use a similar method as however split subsets the data that determines the order of the returned values?

slopes <- sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) 

dataframeplz <- as.data.frame(expand.grid(unique(fac1), unique(fac2))) 

dataframeplz$slope <- slopes

dataframeplz

Here is the plyr solution if that helps. It's so easy but not base R. Anyone know where in Hadley's code this magic happens? Githubbers?

library("plyr")
neatdata <- data.frame(fac1,fac2,x,y)
ddply(neatdata, c("fac1", "fac2"), function(c) coef(lm(c$y~c$x))[2])

标签: r apply
3条回答
走好不送
2楼-- · 2019-07-31 23:53

A. Webb's answer is more elegant, but this lapply/arbitrary function/do.call/rbind workflow has been my last resort for this kind of thing for years:

# Move the factors inside your data frame, so they'll be available after the split()
xy <- data.frame(x, y, fac1, fac2)

# Iterate over the split
reglist <- lapply(split(xy, factors), FUN = function(group) {

    # Get the current factor levels
    group_levels <- unique(group[c("fac1", "fac2")])

    # Mash it all into a data.frame
    data.frame(group_levels, slope = coef(lm(y ~ x, data = group))[2])

})

# Collapse the list into a data.frame
do.call("rbind", reglist)
查看更多
仙女界的扛把子
3楼-- · 2019-08-01 00:01

For base R, aggregate is the user friendly function for such situations.

aggregate(cbind(slope=1:nrow(xy))~fac1+fac2,FUN=function(r) coef(lm(y~x,data=xy[r,]))[2])
  fac1 fac2    slope
1    A    X 1.861290
2    B    X 2.131431
3    A    Y 1.590733
4    B    Y 1.746169

This could also be done with by in a fashion a bit more similar to your original.

setNames(as.data.frame.table(
  by(xy,list(fac1,fac2),FUN=function(c) coef(lm(c$y~c$x))[2])),
  c("fac1","fac2","slope"))
查看更多
疯言疯语
4楼-- · 2019-08-01 00:07

I used base R and I focused on your specific example. This process has limitations as it handles column names as strings and keeps the appropriate info you need.

#create data
fac1 <- c(rep("A", 10), rep("B",10))
fac2 <- rep(c(rep("X", 5), rep("Y",5)),2)
x <- rep(1:5,4)
set.seed(1337)
y <- rep(seq(2, 10, 2), 4) * runif(20, .8, 1.2)

xy <- data.frame(x,y) #bind parameters for regression

factors <- list(fac1, fac2) #split by 2 factors

dt_res = sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) #run regression by these 4 groups, pull out slope

dt_res

# A.X.c$x  B.X.c$x  A.Y.c$x  B.Y.c$x 
# 1.861290 2.131431 1.590733 1.746169


dt_res = data.frame(dt_res)
dt_res = data.frame(names=rownames(dt_res),   # save the names as a column
                    slope=dt_res$dt_res,
                    row.names = NULL)

dt_res$names = gsub(".c[$]x","",dt_res$names)  # delete the unecessary characters from names
dt_res$fac1 = substr(dt_res$names,1,1)       # pick first character
dt_res$fac2 = substr(dt_res$names,3,3)       # pick 3rd character
dt_res[,c("fac1","fac2","slope")]

#    fac1 fac2    slope
# 1    A    X 1.861290
# 2    B    X 2.131431
# 3    A    Y 1.590733
# 4    B    Y 1.746169

I've updated it to something more general:

  #create data
fac1 <- c(rep("A", 10), rep("B",10))
fac2 <- rep(c(rep("X", 5), rep("Y",5)),2)
x <- rep(1:5,4)
set.seed(1337)
y <- rep(seq(2, 10, 2), 4) * runif(20, .8, 1.2)

xy <- data.frame(x,y) #bind parameters for regression

factors <- list(fac1, fac2) #split by 2 factors

res = sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) #run regression by these 4 groups, pull out slope

# split names by . (that will be the split symbol always)
    names = strsplit(names(split(xy, factors)), split ="[.]")

# create empty vectors to store names
fac1 = vector()
fac2 = vector()

for (i in 1:length(names)){

# iterate through the list of names and keep values from the corresponding position
  fac1 = c(fac1, names[[i]][1])
  fac2 = c(fac2, names[[i]][2])
}


data.frame(fac1,
           fac2,
           slope = res,
           row.names = NULL)
查看更多
登录 后发表回答