Use dplyr's summarise and summarise_each toget

2020-03-11 06:39发布

问题:

I would like to apply dplyr::summarise and dplyr::summarise_each at the same time for a grouped data frame. Is it possible?

My data looks like this:

mydf <- data.frame(
    id = c(rep(1,2), rep(2, 3), rep(3, 4)), 
    amount = c(rep(1,4), rep(2,5)), 
    type1 = c(rep(1, 2), rep(0, 7)),
    type2 = c(rep(0, 4), rep(1, 5))
)
mydf
#  id amount type1 type2
#1  1      1     1     0
#2  1      1     1     0
#3  2      1     0     0
#4  2      1     0     0
#5  2      2     0     1
#6  3      2     0     1
#7  3      2     0     1
#8  3      2     0     1
#9  3      2     0     1

I would like to sum over id the amount variable and get the max for the type variables. I know I can do this as follows:

mydf %>% 
    group_by(id) %>% 
    summarise(amount = sum(amount), type1 = max(type1), type2 = max(type2))

However, I have a lot of type variables so I would prefer something like this (but with the sum of amount as well).

mydf %>%
    group_by(id) %>%
    summarise_each(funs(max), matches("type"))

回答1:

Using dplyr

library(dplyr)

mydf %>% 
     group_by(id) %>% 
     mutate(amount = sum(amount)) %>% 
     mutate_each(funs(max), matches("type")) %>%
     unique

#Source: local data table [3 x 4]

#  id amount type1 type2
#1  1      2     1     0
#2  2      4     0     1
#3  3      8     0     1

Or simply as @HongOoi indicated

mydf %>% 
     group_by(id) %>% 
     mutate(amount=sum(amount)) %>% 
     summarise_each(funs(max))


回答2:

I'm not sure regarding the idiomatic way using dplyr, but this is pretty idiomatic using data.table

library(data.table)
setDT(mydf)[, c(amount = sum(amount), 
                lapply(.SD[, grep("type", names(mydf), value = TRUE), with = FALSE], max)),
            by = id]
#    id amount type1 type2
# 1:  1      2     1     0
# 2:  2      4     0     1
# 3:  3      8     0     1

Basically, we are combining both operation using c, while lapply(.SD, max) stands for mutate_each in dplyr and matches is just a wrapper for grep (as clearly shown in the source code). with = FALSE is for standard evaluation of column names within a data.table or .SD parent frame (which stands for SubData).



回答3:

A more general approach with dplyr could be:

mydf %>%
  group_by(id) %>%
  mutate_each('sum', amount) %>%
  mutate_each('max', matches("type")) %>%
  summarise_each('first', amount, matches("type"))

This has the benefit of applying only one aggregate function to each column that Veerendra Gadekar's original answer had. It comes handy if we need sd or similar in place of max, Hong Ooi's solution would break in such case. It would also break if there are character columns. Third advantage is that it drops the columns that are not part of the computation.

See also my related question.



标签: r dplyr