How to use map from purrr with dplyr::mutate to cr

2019-02-06 14:20发布

问题:

I have to following issue using R. In short I want to create multiple new columns in a data frame based on calculations of different column pairs in the data frame.

The data looks as follows:

df <- data.frame(a1 = c(1:5), 
                 b1 = c(4:8), 
                 c1 = c(10:14), 
                 a2 = c(9:13), 
                 b2 = c(3:7), 
                 c2 = c(15:19))
df
a1 b1 c1 a2 b2 c2
1  4 10  9  3 15
2  5 11 10  4 16
3  6 12 11  5 17
4  7 13 12  6 18
5  8 14 13  7 19

The output is supposed to look like the following:

a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1  4 10  9  3 15    10     7    25
2  5 11 10  4 16    12     9    27
4  7 13 12  6 18    16    13    31
5  8 14 13  7 19    18    15    33

I can achieve this using dplyr doing some manual work in the following way:

df %>% rowwise %>% mutate(sum_a = sum(a1, a2),
                          sum_b = sum(b1, b2),
                          sum_c = sum(c1, c2)) %>% 
  as.data.frame()

So what is being done is: take columns with the letter "a" in it, calulate the sum rowwise, and create a new column with the sum named sum_[letter]. Repeat for columns with different letters.

This is working, however, if I have a large data set with say 300 different column pairs the manual input would be significant, since I would have to write 300 mutate calls.

I recently stumbled upon the R package "purrr" and my guess is that this would solve my problem of doing what I want in a more automated way.

In particular, I would think to be able to use purrr:map2 to which I pass two lists of column names.

  • list1 = all columns with the number 1 in it
  • list2 = all columns with the number 2 in it

Then I could calculate the sum of each matching list entry, in the form of:

map2(list1, list2, ~mutate(sum))

However, I am not able to figure out how to best approach this problem using purrr. I am rather new to using purrr, so I would really appreciate any help on this issue.

回答1:

Here is one option with purrr. We get the unique prefix of the names of the dataset ('nm1'), use map (from purrr) to loop through the unique names, select the column that matches the prefix value of 'nm1', add the rows using reduce and the bind the columns (bind_cols) with the original dataset

library(tidyverse)
nm1 <- names(df) %>% 
          substr(1, 1) %>%
          unique 
nm1 %>% 
     map(~ df %>% 
            select(matches(.x)) %>%
            reduce(`+`)) %>%
            set_names(paste0("sum_", nm1)) %>%
     bind_cols(df, .)
#    a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#1  1  4 10  9  3 15    10     7    25
#2  2  5 11 10  4 16    12     9    27
#3  3  6 12 11  5 17    14    11    29
#4  4  7 13 12  6 18    16    13    31
#5  5  8 14 13  7 19    18    15    33


回答2:

In case you like to consider a base R approach, here's how you could do it:

cbind(df, lapply(split.default(df, substr(names(df), 0,1)), rowSums))
#  a1 b1 c1 a2 b2 c2  a  b  c
#1  1  4 10  9  3 15 10  7 25
#2  2  5 11 10  4 16 12  9 27
#3  3  6 12 11  5 17 14 11 29
#4  4  7 13 12  6 18 16 13 31
#5  5  8 14 13  7 19 18 15 33

It splits the data column-wise into a list, based on the first letter of each column name (either a, b, or c).

If you have a large number of columns and need to differentiate between all characters except the numbers at the end of each column name, you could modify the approach to:

cbind(df, lapply(split.default(df, sub("\\d+$", "", names(df))), rowSums))


回答3:

in base R, all vectorized:

nms <- names(df)
df[paste0("sum_",unique(gsub("[1-9]","",nms)))] <-
  df[endsWith(nms,"1")] + df[endsWith(nms,"2")]

#   a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
# 1  1  4 10  9  3 15    10     7    25
# 2  2  5 11 10  4 16    12     9    27
# 3  3  6 12 11  5 17    14    11    29
# 4  4  7 13 12  6 18    16    13    31
# 5  5  8 14 13  7 19    18    15    33


回答4:

For a hackish tidy solution, check this out:

library(tidyr)
library(dplyr)

df %>% 
   rownames_to_column(var = 'row') %>% 
   gather(a1:c2, key = 'key', value = 'value') %>% 
   extract(key, into = c('col.base', 'col.index'), regex = '([a-zA-Z]+)([0-9]+)') %>% 
   group_by(row, col.base) %>% 
   summarize(.sum = sum(value)) %>%
   spread(col.base, .sum) %>% 
   bind_cols(df, .) %>% 
   select(-row)

Basically, I collect all pairs of columns with their values across all rows, separate the column name in two parts, calculate the row sums for columns with the same letter, and cast it back to the wide form.



回答5:

Another solution that splits df by the numbers than use Reduce to calculate the sum

library(tidyverse)

df %>% 
  split.default(., substr(names(.), 2, 3)) %>% 
  Reduce('+', .) %>% 
  set_names(paste0("sum_", substr(names(.), 1, 1))) %>% 
  cbind(df, .)

#>   a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#> 1  1  4 10  9  3 15    10     7    25
#> 2  2  5 11 10  4 16    12     9    27
#> 3  3  6 12 11  5 17    14    11    29
#> 4  4  7 13 12  6 18    16    13    31
#> 5  5  8 14 13  7 19    18    15    33

Created on 2018-04-13 by the reprex package (v0.2.0).



回答6:

1) dplyr/tidyr Convert to long form, summarize and convert back to wide form:

library(dplyr)
library(tidyr)

DF %>%
  mutate(Row = 1:n()) %>%
  gather(colname, value, -Row) %>%
  group_by(g = gsub("\\d", "", colname), Row) %>%
  summarize(sum = sum(value)) %>%
  ungroup %>%
  mutate(g = paste("sum", g, sep = "_")) %>%
  spread(g, sum) %>%
  arrange(Row) %>%
  cbind(DF, .) %>%
  select(-Row)

giving:

  a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1  1  4 10  9  3 15    10     7    25
2  2  5 11 10  4 16    12     9    27
3  4  7 13 12  6 18    16    13    31
4  5  8 14 13  7 19    18    15    33

2) base using matrix multiplication

nms is a vector of column names without the digits and prefaced with sum_. u is a vector of the unique elements of it. Form a logical matrix using outer from that which when multiplied by DF gives the sums -- the logicals get converted to 0-1 when that is done. Finally bind it to the input.

nms <- gsub("(\\D+)\\d", "sum_\\1", names(DF))
u <- unique(nms)
sums <- as.matrix(DF) %*% outer(nms, setNames(u, u), "==")
cbind(DF, sums)

giving:

  a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1  1  4 10  9  3 15    10     7    25
2  2  5 11 10  4 16    12     9    27
3  4  7 13 12  6 18    16    13    31
4  5  8 14 13  7 19    18    15    33

3) base with tapply

Using nms from (2) apply tapply to each row:

cbind(DF, t(apply(DF, 1, tapply, nms, sum)))

giving:

  a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1  1  4 10  9  3 15    10     7    25
2  2  5 11 10  4 16    12     9    27
3  4  7 13 12  6 18    16    13    31
4  5  8 14 13  7 19    18    15    33

You may wish to replace nms with factor(nms, levels = unique(nms)) in the above expression if the names are not in ascending order.



回答7:

df %>% 
  mutate(sum_a = pmap_dbl(select(., starts_with("a")), sum), 
         sum_b = pmap_dbl(select(., starts_with("b")), sum),
         sum_c = pmap_dbl(select(., starts_with("c")), sum))

  a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1  1  4 10  9  3 15    10     7    25
2  2  5 11 10  4 16    12     9    27
3  3  6 12 11  5 17    14    11    29
4  4  7 13 12  6 18    16    13    31
5  5  8 14 13  7 19    18    15    33


回答8:

A slightly different approach using base R:

cbind(df, lapply(unique(gsub("\\d+","", colnames(df))), function(li) {
   set_names(data.frame(V = apply(df[grep(li, colnames(df), val = T)], FUN = sum, MARGIN = 1)), paste0("sum_", li))
}))
#  a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#1  1  4 10  9  3 15    10     7    25
#2  2  5 11 10  4 16    12     9    27
#3  3  6 12 11  5 17    14    11    29
#4  4  7 13 12  6 18    16    13    31
#5  5  8 14 13  7 19    18    15    33