I am trying to use pipe mutate statement using a custom function. I looked a this somewhat similar SO post but in vain.
Say I have a data frame like this (where blob
is some variable not related to the specific task but is part of the entire data) :
df <-
data.frame(exclude=c('B','B','D'),
B=c(1,0,0),
C=c(3,4,9),
D=c(1,1,0),
blob=c('fd', 'fs', 'sa'),
stringsAsFactors = F)
I have a function that uses the variable names so select some based on the value in the exclude
column and e.g. calculates a sum on the variables not specified in exclude
(which is always a single character).
FUN <- function(df){
sum(df[c('B', 'C', 'D')] [!names(df[c('B', 'C', 'D')]) %in% df['exclude']] )
}
When I gives a single row (row 1) to FUN
I get the the expected sum of C
and D
(those not mentioned by exclude
), namely 4:
FUN(df[1,])
How do I do similarly in a pipe with mutate (adding the result to a variable s
). These two tries do not work:
df %>% mutate(s=FUN(.))
df %>% group_by(1:n()) %>% mutate(s=FUN(.))
UPDATE
This also do not work as intended:
df %>% rowwise(.) %>% mutate(s=FUN(.))
This works of cause but is not within dplyr's mutate (and pipes):
df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))
If you want to use dplyr
you can do so using rowwise
and your function FUN
.
df %>%
rowwise %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The same can be achieved using group_by
instead of rowwise
(like you already tried) but with do
instead of mutate
df %>%
group_by(1:n()) %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The reason mutate
doesn't work in this case, is that you are passing the whole tibble to it, so it's like calling FUN(df)
.
A much more efficient way of doing the same thing though is to just make a matrix of columns to be included and then use rowSums
.
cols <- c('B', 'C', 'D')
include_mat <- outer(function(x, y) x != y, X = df$exclude, Y = cols)
# or outer(`!=`, X = df$exclude, Y = cols) if it's more readable to you
df$s <- rowSums(df[cols] * include_mat)
purrr
approach
We can use a combination of nest
and map_dbl
for this:
library(tidyverse)
df %>%
rowwise %>%
nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>%
unnest
Let's break that down a little bit. First, rowwise
allows us to apply each subsequent function to support arbitrary complex operations that need to be applied to each row.
Next, nest
will create a new column that is a list of our data to be fed into FUN
(the beauty of tibbles vs data.frames!). Since we are applying this rowwise
, each row contains a single-row tibble of exclude:D
.
Finally, we use map_dbl
to map our FUN
to each of these tibbles. map_dbl
is used over the family of other map_*
functions since our intended output is numeric (i.e. double).
unnest
returns our tibble into the more standard structure.
purrrlyr
approach
While purrrlyr
may not be as 'popular' as its parents dplyr
and purrr
, its by_row
function has some utility here.
In your above example, we would use your data frame df
and user-defined function FUN
in the following way:
df %>%
by_row(..f = FUN, .to = "s", .collate = "cols")
That's it! Giving you:
# tibble [3 x 6]
exclude B C D blob s
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 B 1 3 1 fd 4
2 B 0 4 1 fs 5
3 D 0 9 0 sa 9
Admittedly, the syntax is a little strange, but here's how it breaks down:
..f
= the function to apply to each row
.to
= the name of the output column, in this case s
.collate
= the way the results should be collated, by list, row, or column. Since FUN
only has a single output, we would be fine to use either "cols"
or "rows"
See here for more information on using purrrlyr
...
Performance
Forewarning, while I like the functionality of by_row
, it's not always the best approach for performance! purrr
is more intuitive, but also at a rather large speed loss. See the following microbenchmark
test:
library(microbenchmark)
mbm <- microbenchmark(
purrr.test = df %>% rowwise %>% nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>% unnest,
purrrlyr.test = df %>% by_row(..f = FUN, .to = "s", .collate = "cols"),
rowwise.test = df %>%
rowwise %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
group_by.test = df %>%
group_by(1:n()) %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
sapply.test = {df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))},
times = 1000
)
autoplot(mbm)
You can see that the purrrlyr
approach is faster than the approach of using a combination of do
with rowwise
or group_by(1:n())
(see @konvas answer), and rather on par with the sapply
approach. However, the package is admittedly not the most intuitive. The standard purrr
approach seems to be the slowest, but also perhaps easier to work with. Different user-defined functions may change the speed order.