R: row-wise dplyr::mutate using function that take

2020-02-12 05:52发布

问题:

I am trying to use pipe mutate statement using a custom function. I looked a this somewhat similar SO post but in vain. Say I have a data frame like this (where blob is some variable not related to the specific task but is part of the entire data) :

df <- 
  data.frame(exclude=c('B','B','D'), 
             B=c(1,0,0), 
             C=c(3,4,9), 
             D=c(1,1,0), 
             blob=c('fd', 'fs', 'sa'), 
             stringsAsFactors = F)

I have a function that uses the variable names so select some based on the value in the exclude column and e.g. calculates a sum on the variables not specified in exclude (which is always a single character).

FUN <- function(df){
  sum(df[c('B', 'C', 'D')] [!names(df[c('B', 'C', 'D')]) %in% df['exclude']] )
}

When I gives a single row (row 1) to FUN I get the the expected sum of C and D (those not mentioned by exclude), namely 4:

FUN(df[1,])

How do I do similarly in a pipe with mutate (adding the result to a variable s). These two tries do not work:

df %>% mutate(s=FUN(.))
df %>% group_by(1:n()) %>% mutate(s=FUN(.))

UPDATE This also do not work as intended:

df %>% rowwise(.) %>% mutate(s=FUN(.))

This works of cause but is not within dplyr's mutate (and pipes):

df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))

回答1:

If you want to use dplyr you can do so using rowwise and your function FUN.

df %>% 
    rowwise %>% 
    do({
        result = as_data_frame(.)
        result$s = FUN(result)
        result
    })

The same can be achieved using group_by instead of rowwise (like you already tried) but with do instead of mutate

df %>% 
    group_by(1:n()) %>% 
    do({
        result = as_data_frame(.)
        result$s = FUN(result)
        result
    })

The reason mutate doesn't work in this case, is that you are passing the whole tibble to it, so it's like calling FUN(df).

A much more efficient way of doing the same thing though is to just make a matrix of columns to be included and then use rowSums.

cols <- c('B', 'C', 'D')
include_mat <- outer(function(x, y) x != y, X = df$exclude, Y = cols)
# or outer(`!=`, X = df$exclude, Y = cols) if it's more readable to you
df$s <- rowSums(df[cols] * include_mat)


回答2:

purrr approach

We can use a combination of nest and map_dbl for this:

library(tidyverse)
df %>% 
  rowwise %>% 
  nest(-blob) %>% 
  mutate(s = map_dbl(data, FUN)) %>% 
  unnest

Let's break that down a little bit. First, rowwise allows us to apply each subsequent function to support arbitrary complex operations that need to be applied to each row.

Next, nest will create a new column that is a list of our data to be fed into FUN (the beauty of tibbles vs data.frames!). Since we are applying this rowwise, each row contains a single-row tibble of exclude:D.

Finally, we use map_dbl to map our FUN to each of these tibbles. map_dbl is used over the family of other map_* functions since our intended output is numeric (i.e. double).

unnest returns our tibble into the more standard structure.

purrrlyr approach

While purrrlyr may not be as 'popular' as its parents dplyr and purrr, its by_row function has some utility here.

In your above example, we would use your data frame df and user-defined function FUN in the following way:

df %>% 
  by_row(..f = FUN, .to = "s", .collate = "cols")

That's it! Giving you:

# tibble [3 x 6]
  exclude     B     C     D  blob     s
    <chr> <dbl> <dbl> <dbl> <chr> <dbl>
1       B     1     3     1    fd     4
2       B     0     4     1    fs     5
3       D     0     9     0    sa     9

Admittedly, the syntax is a little strange, but here's how it breaks down:

  • ..f = the function to apply to each row
  • .to = the name of the output column, in this case s
  • .collate = the way the results should be collated, by list, row, or column. Since FUN only has a single output, we would be fine to use either "cols" or "rows"

See here for more information on using purrrlyr...


Performance

Forewarning, while I like the functionality of by_row, it's not always the best approach for performance! purrr is more intuitive, but also at a rather large speed loss. See the following microbenchmark test:

library(microbenchmark)
mbm <- microbenchmark(
  purrr.test = df %>% rowwise %>% nest(-blob) %>% 
    mutate(s = map_dbl(data, FUN)) %>% unnest,
  purrrlyr.test = df %>% by_row(..f = FUN, .to = "s", .collate = "cols"),
  rowwise.test = df %>% 
    rowwise %>% 
    do({
      result = as_tibble(.)
      result$s = FUN(result)
      result
    }),
  group_by.test = df %>% 
    group_by(1:n()) %>% 
    do({
      result = as_tibble(.)
      result$s = FUN(result)
      result
    }),
  sapply.test = {df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))}, 
  times = 1000
)
autoplot(mbm)

You can see that the purrrlyr approach is faster than the approach of using a combination of do with rowwise or group_by(1:n()) (see @konvas answer), and rather on par with the sapply approach. However, the package is admittedly not the most intuitive. The standard purrr approach seems to be the slowest, but also perhaps easier to work with. Different user-defined functions may change the speed order.



标签: r dplyr row mutate