Using mutate rowwise over a subset of columns

2020-03-26 06:34发布

问题:

I am trying to create a new column that will contain a result of calculations done rowwise over a subset of columns of a tibble, and add this new column to the existing tibble. Like so:

df <- tibble(
ID = c("one", "two", "three"),
A1 = c(1, 1, 1),
A2 = c(2, 2, 2),
A3 = c(3, 3, 3)
)

I effectively want to do a dplyr equivalent of this code from base R:

df$SumA <- rowSums(df[,grepl("^A", colnames(df))])

My problem is that this doesn't work:

df %>% 
select(starts_with("A")) %>% 
mutate(SumA = rowSums(.))
    # some code here

...because I got rid of the "ID" column in order to let mutate run the rowSums over the other (numerical) columns. I have tried to cbind or bind_cols in the pipe after the mutate, but it doesn't work. None of the variants of mutate work, because they work in-place (within each cell of the tibble, and not across the columns, even with rowwise).

This does work, but doesn't strike me as an elegant solution:

df %>% 
mutate(SumA = rowSums(.[,grepl("^A", colnames(df))]))

Is there any tidyverse-based solution that does not require grepl or square brackets but only more standard dplyr verbs and parameters?

My expected output is this:

df_out <- tibble(
ID = c("one", "two", "three"),
A1 = c(1, 1, 1),
A2 = c(2, 2, 2),
A3 = c(3, 3, 3),
SumA = c(6, 6, 6)
)

Best kJ

回答1:

Here's one way to approach row-wise computation in the tidyverse using purrr::pmap. This is best used with functions that actually need to be run row by row; simple addition could probably be done a faster way. Basically we use select to provide the input list to pmap, which lets us use the select helpers such as starts_with or matches if you need regex.

library(tidyverse)
df <- tibble(
  ID = c("one", "two", "three"),
  A1 = c(1, 1, 1),
  A2 = c(2, 2, 2),
  A3 = c(3, 3, 3)
)

df %>%
  mutate(
    SumA = pmap_dbl(
      .l = select(., starts_with("A")),
      .f = function(...) sum(...)
    )
  )
#> # A tibble: 3 x 5
#>   ID       A1    A2    A3  SumA
#>   <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 one       1     2     3     6
#> 2 two       1     2     3     6
#> 3 three     1     2     3     6

Created on 2019-01-30 by the reprex package (v0.2.1)



回答2:

Here's a different approach that doesn't move rowwise but instead exploits the vectorised nature of addition and that addition commutes. That lets use repeatedly apply + with purrr::reduce

library(tidyverse)
df <- tibble(
  ID = c("one", "two", "three"),
  A1 = c(1, 1, 1),
  A2 = c(2, 2, 2),
  A3 = c(3, 3, 3)
)

df %>%
  mutate(
    SumA = reduce(
      .x = select(., starts_with("A")),
      .f = `+`
    )
  )
#> # A tibble: 3 x 5
#>   ID       A1    A2    A3  SumA
#>   <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 one       1     2     3     6
#> 2 two       1     2     3     6
#> 3 three     1     2     3     6

Created on 2019-01-30 by the reprex package (v0.2.1)



回答3:

1) To do it with rowSums try nesting a second pipeline in the mutate like this:

library(dplyr)

df %>% mutate(Sum = select(., starts_with("A")) %>% rowSums)

giving:

# A tibble: 3 x 5
  ID       A1    A2    A3   Sum
  <chr> <dbl> <dbl> <dbl> <dbl>
1 one       1     2     3     6
2 two       1     2     3     6
3 three     1     2     3     6

2) An alternative is to reshape it to long form and then summarize:

library(dplyr)
library(purrr)
library(tidyr)

df %>%
  mutate(Sum = gather(., key, value, -ID) %>% 
               group_by(., ID) %>%
               summarize(sum = sum(value)) %>%
               ungroup %>%
               pull(sum))

giving:

# A tibble: 3 x 5
  ID       A1    A2    A3   Sum
  <chr> <dbl> <dbl> <dbl> <dbl>
1 one       1     2     3     6
2 two       1     2     3     6
3 three     1     2     3     6


回答4:

[upd] I didn't notice that @Calum used a nearly the same approach.

Another possible way to do that:

library(dplyr)
library(purrr)

dat %>%
  mutate(SumA = pmap_dbl(select(., contains('A')), sum))   

Data:

# dat <- tibble(
#   ID = c("one", "two", "three"),
#   A1 = c(1, 1, 1),
#   A2 = c(2, 2, 2),
#   A3 = c(3, 3, 3)
# )

Output:

# # A tibble: 3 x 5
#   ID       A1    A2    A3  SumA
#   <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one       1     2     3     6
# 2 two       1     2     3     6
# 3 three     1     2     3     6


回答5:

You could nest and use rowSums on the nested columns :

library(tidyverse)
df %>% nest(-ID) %>%
  mutate(SumA = map_dbl(data,rowSums)) %>%
  unnest

# # A tibble: 3 x 5
#      ID  SumA    A1    A2    A3
#   <chr> <dbl> <dbl> <dbl> <dbl>
# 1   one     6     1     2     3
# 2   two     6     1     2     3
# 3 three     6     1     2     3

Or this variant on the pmap approach :

df %>% mutate(SumA = pmap_dbl(.[-1],sum))
# # A tibble: 3 x 5
#      ID    A1    A2    A3  SumA
#   <chr> <dbl> <dbl> <dbl> <dbl>
# 1   one     1     2     3     6
# 2   two     1     2     3     6
# 3 three     1     2     3     6

And to show that base is sometimes easier :

df$SumA <- rowSums(df[-1])


标签: r dplyr