How can I get mean values by group?

2020-05-02 14:24发布

问题:

I want to get mean of var1 and var2 by group low and high. How can I get mean of two variables each by group (low and high) ?

 ID     var1       var2      low     high 
 1        1          6        0        1
 2        2          7        0        1
 3        3          8        1        0
 4        4          9        1        0
 5        5         10        0        1


回答1:

aggregate does what you need, given the proper input.

To get the aggregate of multiple columns, you can cbind them so that they are separate columns in the result:

aggregate(cbind(var1, var2) ~ low+high, data=x, FUN=mean)
##   low high     var1     var2
## 1   1    0 3.500000 8.500000
## 2   0    1 2.666667 7.666667

If you want to take the mean of every column other than low and high, . is handy, meaning "all other columns":

aggregate(. ~ low+high, data=x, FUN=mean)
##   low high       ID     var1     var2
## 1   1    0 3.500000 3.500000 8.500000
## 2   0    1 2.666667 2.666667 7.666667

Note that + has a special meaning in the formula if it is on the right side of the ~. It doesn't mean a sum, but it means using both factors. On the left side, it means addition.



回答2:

A dplyr solution:

ID<-c(1:5)
var1<-c(1:5)
var2<-c(6:10)
low<-c(0,0,1,1,0)
high<-c(1,1,0,0,1)
mydf<-data.frame(ID,var1,var2,low,high)

library(dplyr)
mydf %>%
  group_by(low, high) %>%
  summarise(mean_var1=mean(var1), mean_var2=mean(var2))

which gives you:

  low high mean_var1 mean_var2
1   0    1  2.666667  7.666667
2   1    0  3.500000  8.500000

as Richard Scriven points out, you might be talking about the sum of var 1 and var 2 that you want to mean, in which case:

library(dplyr)
mydf %>%
  mutate(sum_vars=var1+var2) %>%
  group_by(low, high) %>%
  summarise(mean_sumvars=mean(sum_vars))


  low high mean_sumvars
1   0    1     10.33333
2   1    0     12.00000


回答3:

Here is an option using data.table

library(data.table)
setDT(df1)[, lapply(.SD, mean) ,.(low, high), .SDcols = var1:var2]
#   low high     var1     var2
#1:   0    1 2.666667 7.666667
#2:   1    0 3.500000 8.500000

and for the second case

setDT(df1)[, .(sumvars = Reduce(`+`, lapply(.SD, mean))) ,.(low, high), .SDcols = var1:var2]
#   low high  sumvars
#1:   0    1 10.33333
#2:   1    0 12.00000


回答4:

For individual variables, tapply is also very convenient, especially if multiple groups are there:

> with (dat, tapply(var1, list(low, high), mean))
    0        1
0  NA 2.666667
1 3.5       NA
> 
> 
> with (dat, tapply(var2, list(low, high), mean))
    0        1
0  NA 7.666667
1 8.5       NA
> 
> 
> with (dat, tapply(var1+var2, list(low, high), mean))
   0        1
0 NA 10.33333
1 12       NA
>