I would like to apply dplyr::summarise
and dplyr::summarise_each
at the same time for a grouped data frame. Is it possible?
My data looks like this:
mydf <- data.frame(
id = c(rep(1,2), rep(2, 3), rep(3, 4)),
amount = c(rep(1,4), rep(2,5)),
type1 = c(rep(1, 2), rep(0, 7)),
type2 = c(rep(0, 4), rep(1, 5))
)
mydf
# id amount type1 type2
#1 1 1 1 0
#2 1 1 1 0
#3 2 1 0 0
#4 2 1 0 0
#5 2 2 0 1
#6 3 2 0 1
#7 3 2 0 1
#8 3 2 0 1
#9 3 2 0 1
I would like to sum over id
the amount
variable and get the max for the type
variables. I know I can do this as follows:
mydf %>%
group_by(id) %>%
summarise(amount = sum(amount), type1 = max(type1), type2 = max(type2))
However, I have a lot of type
variables so I would prefer something like this (but with the sum of amount
as well).
mydf %>%
group_by(id) %>%
summarise_each(funs(max), matches("type"))
Using dplyr
library(dplyr)
mydf %>%
group_by(id) %>%
mutate(amount = sum(amount)) %>%
mutate_each(funs(max), matches("type")) %>%
unique
#Source: local data table [3 x 4]
# id amount type1 type2
#1 1 2 1 0
#2 2 4 0 1
#3 3 8 0 1
Or simply as @HongOoi indicated
mydf %>%
group_by(id) %>%
mutate(amount=sum(amount)) %>%
summarise_each(funs(max))
I'm not sure regarding the idiomatic way using dplyr
, but this is pretty idiomatic using data.table
library(data.table)
setDT(mydf)[, c(amount = sum(amount),
lapply(.SD[, grep("type", names(mydf), value = TRUE), with = FALSE], max)),
by = id]
# id amount type1 type2
# 1: 1 2 1 0
# 2: 2 4 0 1
# 3: 3 8 0 1
Basically, we are combining both operation using c
, while lapply(.SD, max)
stands for mutate_each
in dplyr
and matches
is just a wrapper for grep
(as clearly shown in the source code). with = FALSE
is for standard evaluation of column names within a data.table
or .SD
parent frame (which stands for SubData).
A more general approach with dplyr
could be:
mydf %>%
group_by(id) %>%
mutate_each('sum', amount) %>%
mutate_each('max', matches("type")) %>%
summarise_each('first', amount, matches("type"))
This has the benefit of applying only one aggregate function to each column that Veerendra Gadekar's original answer had. It comes handy if we need sd
or similar in place of max
, Hong Ooi's solution would break in such case. It would also break if there are character columns. Third advantage is that it drops the columns that are not part of the computation.
See also my related question.