dplyr summarise when function return is vector-val

2020-06-07 06:28发布

问题:

The dplyr::summarize() function can apply arbitrary functions over the data, but it seems that function must return a scalar value. I'm curious if there is a reasonable way to handle functions that return a vector value without making multiple calls to the function.

Here's a somewhat silly minimal example. Consider a function that gives multiple values, such as:

f <- function(x,y){
  coef(lm(x ~ y, data.frame(x=x,y=y)))
}

and data that looks like:

df <- data.frame(group=c('A','A','A','A','B','B','B','B','C','C','C','C'), x=rnorm(12,1,1), y=rnorm(12,1,1))

I'd like to do something like:

df %>% 
group_by(group) %>%
summarise(f(x,y))

and get back a table that has 2 columns added for each of the returned values instead of the usual 1 column. Instead, this errors with: Expecting single value

Of course we can get multiple values from dlpyr::summarise() by giving the function argument multiple times:

f1 <- function(x,y) coef(lm(x ~ y, data.frame(x=x,y=y)))[[1]]
f2 <- function(x,y) coef(lm(x ~ y, data.frame(x=x,y=y)))[[2]]

df %>% 
group_by(group) %>%
summarise(a = f1(x,y), b = f2(x,y))

This gives the desired output:

  group         a            b
1     A 1.7957245 -0.339992915
2     B 0.5283379 -0.004325209
3     C 1.0797647 -0.074393457

but coding in this way is ridiculously crude and ugly.

data.table handles this case more succinctly:

dt <- as.data.table(df)
dt[, f(x,y), by="group"]

but creates an output that extend the table using additional rows instead of additional columns, resulting in an output that is both confusing and harder to work with:

 group           V1
1:     A  1.795724536
2:     A -0.339992915
3:     B  0.528337890
4:     B -0.004325209
5:     C  1.079764710
6:     C -0.074393457

Of course there are more classic apply strategies we could use here,

sapply(levels(df$group), function(x) coef(lm(x~y, df[df$group == x, ])))


                     A            B           C
(Intercept)  1.7957245  0.528337890  1.07976471
y           -0.3399929 -0.004325209 -0.07439346

but this sacrifices both the elegance and I suspect the speed of the grouping. In particular, note that we cannot use our pre-defined function f in this case, but have to hard code the grouping into the function definition.

Is there a dplyr function for handling this case? If not, is there a more elegant way to handle this process of evaluating vector-valued functions over a data.frame by group?

回答1:

You could try do

library(dplyr)
 df %>%
    group_by(group) %>%
    do(setNames(data.frame(t(f(.$x, .$y))), letters[1:2]))
 # group         a           b
 #1     A 0.8983217 -0.04108092
 #2     B 0.8945354  0.44905220
 #3     C 1.2244023 -1.00715248

The output based on f1 and f2 are

df %>% 
  group_by(group) %>%
  summarise(a = f1(x,y), b = f2(x,y))
#  group         a           b
#1     A 0.8983217 -0.04108092
#2     B 0.8945354  0.44905220
#3     C 1.2244023 -1.00715248

Update

If you are using data.table, the option to get similar result is

 library(data.table)
 setnames(setDT(df)[, as.list(f(x,y)) , group], 2:3, c('a', 'b'))[]


回答2:

This is why I still love plyr::ddply():

library(plyr)
f <- function(z) setNames(coef(lm(x ~ y, z)), c("a", "b"))
ddply(df, ~ group, f)
#   group           a          b
# 1     A   0.5213133 0.04624656
# 2     B   0.3020656 0.01450137
# 3     C   0.2189537 0.22998823


标签: r dplyr