Sum multiple variables by group and create new col

2020-03-31 04:46发布

问题:

I have a data frame with grouped variable and I want to sum them by group. It's easy with dplyr.

library(dplyr)
library(magrittr)

data <- data.frame(group = c("a", "a", "b", "c", "c"), n1 = 1:5, n2 = 2:6)

data %>% group_by(group) %>%
  summarise_all(sum)

# A tibble: 3 x 3
   group    n1    n2
  <fctr> <int> <int>
1      a     3     5
2      b     3     4
3      c     9    11

But now I want a new column total with the sum of n1 and n2 by group. Like this:

# A tibble: 3 x 3
   group    n1    n2   ttl
  <fctr> <int> <int> <int>
1      a     3     5     8
2      b     3     4     7
3      c     9    11    20

How can I do that with dplyr?

EDIT: Actually, it's just an example, I have a lot of variables.

I tried these two codes but it's not in the right dimension...

data %>% group_by(group) %>%
  summarise_all(sum) %>%
  summarise_if(is.numeric, sum)

data %>% group_by(group) %>%
  summarise_all(sum) %>%
  mutate_if(is.numeric, .funs = sum)

回答1:

You can use mutate after summarize:

data %>% 
    group_by(group) %>%
    summarise_all(sum) %>% 
    mutate(tt1 = n1 + n2)

# A tibble: 3 x 4
#   group    n1    n2   tt1
#  <fctr> <int> <int> <int>
#1      a     3     5     8
#2      b     3     4     7
#3      c     9    11    20

If need to sum all numeric columns, you can use rowSums with select_if (to select numeric columns) to sum columns up:

data %>% 
    group_by(group) %>%
    summarise_all(sum) %>% 
    mutate(tt1 = rowSums(select_if(., is.numeric)))

# A tibble: 3 x 4
#   group    n1    n2   tt1
#  <fctr> <int> <int> <dbl>
#1      a     3     5     8
#2      b     3     4     7
#3      c     9    11    20


回答2:

We can use apply together with the dplyr functions.

data <- data.frame(group = c("a", "a", "b", "c", "c"), n1 = 1:5, n2 = 2:6)

data %>% group_by(group) %>%
  summarise_all(sum) %>%
  mutate(ttl = apply(.[, 2:ncol(.)], 1, sum))

# A tibble: 3 × 4
   group    n1    n2   ttl
  <fctr> <int> <int> <int>
1      a     3     5     8
2      b     3     4     7
3      c     9    11    20

Or rowSums with the same strategy. The key is to use . to specify the data frame and [] with x:ncol(.) to keep the columns you want.

data %>% group_by(group) %>%
  summarise_all(sum) %>%
  mutate(ttl = rowSums(.[, 2:ncol(.)]))

# A tibble: 3 × 4
   group    n1    n2   ttl
  <fctr> <int> <int> <dbl>
1      a     3     5     8
2      b     3     4     7
3      c     9    11    20


回答3:

Base R

cbind(aggregate(.~group, data, sum), ttl = sapply(split(data[,-1], data$group), sum))
#  group n1 n2 ttl
#a     a  3  5   8
#b     b  3  4   7
#c     c  9 11  20


回答4:

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'group', get the sum of each columns in the Subset of data.table, and then with Reduce, get the sum of the rows of the columns of interest

library(data.table)
setDT(data)[, lapply(.SD, sum) , group][, tt1 := Reduce(`+`, .SD),  
                       .SDcols = names(data)[-1]][]
#   group n1 n2 tt1
#1:     a  3  5   8
#2:     b  3  4   7
#3:     c  9 11  20

Or with base R

addmargins(as.matrix(rowsum(data[-1], data$group)), 2)
#  n1 n2 Sum
#a  3  5   8
#b  3  4   7
#c  9 11  20

Or with dplyr

data %>%
   group_by(group) %>%
   summarise_all(sum) %>% 
   mutate(tt = rowSums(.[-1]))