My question builds on a similar one by imposing an additional constraint that the name of each variable should appear only once.
Consider a data frame
library( tidyverse )
df <- tibble( potentially_long_name_i_dont_want_to_type_twice = 1:10,
another_annoyingly_long_name = 21:30 )
I would like to apply mean
to the first column and sum
to the second column, without unnecessarily typing each column name twice.
As the question I linked above shows, summarize
allows you to do this, but requires that the name of each column appears twice. On the other hand, summarize_at
allows you to succinctly apply multiple functions to multiple columns, but it does so by calling all specified functions on all specified columns, instead of doing it in a one-to-one fashion. Is there a way to combine these distinct features of summarize
and summarize_at
?
I was able to hack it with rlang
, but I'm not sure if it's any cleaner than just typing each variable twice:
v <- c("potentially_long_name_i_dont_want_to_type_twice",
"another_annoyingly_long_name")
f <- list(mean,sum)
## Desired output
smrz <- set_names(v) %>% map(sym) %>% map2( f, ~rlang::call2(.y,.x) )
df %>% summarize( !!!smrz )
# # A tibble: 1 x 2
# potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
# <dbl> <int>
# 1 5.5 255
EDIT to address some philosophical points
I don’t think that wanting to avoid the x=f(x)
idiom is unreasonable. I probably came across a bit overzealous about typing long names, but the real issue is actually having (relatively) long names that are very similar to each other. Examples include nucleotide sequences (e.g., AGCCAGCGGAAACAGTAAGG
) and TCGA barcodes. Not only is autocomplete of limited utility in such cases, but writing things like AGCCAGCGGAAACAGTAAGG = sum( AGCCAGCGGAAACAGTAAGG )
introduces unnecessary coupling and increases the risk that the two sides of the assignment might accidentally go out of sync as the code is developed and maintained.
I completely agree with @MrFlick about dplyr
increasing code readability, but I don’t think that readability should come at the cost of correctness. Functions like summarize_at
and mutate_at
are brilliant, because they strike a perfect balance between placing operations next to their operands (clarity) and guaranteeing that the result is written to the correct column (correctness).
By the same token, I feel that the proposed solutions which remove variable mention altogether swing too far in the other direction. While inherently clever -- and I certainly appreciate the extra typing they save -- I think that, by removing the association between functions and variable names, such solutions now rely on proper ordering of variables, which creates its own risks of accidental errors.
In short, I believe that a self-mutating / self-summarizing operation should mention each variable name exactly once.
I propose 2 tricks to solve this issue, see the code and some details for both solutions at the bottom :
A function .at
that returns results for for groups of variables (here only one variable by group) that we can then unsplice, so we benefit from both worlds, summarize
and summarize_at
:
df %>% summarize(
!!!.at(vars(potentially_long_name_i_dont_want_to_type_twice), mean),
!!!.at(vars(another_annoyingly_long_name), sum))
# # A tibble: 1 x 2
# potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
# <dbl> <dbl>
# 1 5.5 255
An adverb to summarize
, with a dollar notation shorthand.
df %>%
..flx$summarize(potentially_long_name_i_dont_want_to_type_twice = ~mean(.),
another_annoyingly_long_name = ~sum(.))
# # A tibble: 1 x 2
# potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
# <dbl> <int>
# 1 5.5 255
code for .at
It has to be used in a pipe because it uses the .
in the parent environment, messy but it works.
.at <- function(.vars, .funs, ...) {
in_a_piped_fun <- exists(".",parent.frame()) &&
length(ls(envir=parent.frame(), all.names = TRUE)) == 1
if (!in_a_piped_fun)
stop(".at() must be called as an argument to a piped function")
.tbl <- try(eval.parent(quote(.)))
dplyr:::manip_at(
.tbl, .vars, .funs, rlang::enquo(.funs), rlang:::caller_env(),
.include_group_vars = TRUE, ...)
}
I designed it to combine summarize
and summarize_at
:
df %>% summarize(
!!!.at(vars(potentially_long_name_i_dont_want_to_type_twice), list(foo=min, bar = max)),
!!!.at(vars(another_annoyingly_long_name), median))
# # A tibble: 1 x 3
# foo bar another_annoyingly_long_name
# <dbl> <dbl> <dbl>
# 1 1 10 25.5
code for ..flx
..flx
outputs a function that replaces its formula arguments such as a = ~mean(.)
by calls a = purrr::as_mapper(~mean(.))(a)
before running. Convenient with summarize
and mutate
because a column cannot be a formula so there can't be any conflict.
I like to use the dollar notation as a shorthand and to have names starting with ..
so I can name those "tags" (and give them a class "tag"
) and see them as different objects (still experimenting with this). ..flx(summarize)(...)
will work as well though.
..flx <- function(fun){
function(...){
mc <- match.call()
mc[[1]] <- tail(mc[[1]],1)[[1]]
mc[] <- imap(mc,~if(is.call(.) && identical(.[[1]],quote(`~`))) {
rlang::expr(purrr::as_mapper(!!.)(!!sym(.y)))
} else .)
eval.parent(mc)
}
}
class(..flx) <- "tag"
`$.tag` <- function(e1, e2){
# change original call so x$y, which is `$.tag`(tag=x, data=y), becomes x(y)
mc <- match.call()
mc[[1]] <- mc[[2]]
mc[[2]] <- NULL
names(mc) <- NULL
# evaluate it in parent env
eval.parent(mc)
}
Use .[[i]]
and !!names(.)[i]:=
to refer to the ith column and its name.
library(tibble)
library(dplyr)
library(rlang)
df %>% summarize(!!names(.)[1] := mean(.[[1]]), !!names(.)[2] := sum(.[[2]]))
giving:
# A tibble: 1 x 2
potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
<dbl> <int>
1 5.5 255
Update
If df were grouped (it is not in the question so this is not needed) then surround summarize
with a do
like this:
library(dplyr)
library(rlang)
library(tibble)
df2 <- tibble(a = 1:10, b = 11:20, g = rep(1:2, each = 5))
df2 %>%
group_by(g) %>%
do(summarize(., !!names(.)[1] := mean(.[[1]]), !!names(.)[2] := sum(.[[2]]))) %>%
ungroup
giving:
# A tibble: 2 x 3
g a b
<int> <dbl> <int>
1 1 3 65
2 2 8 90
Here's a hacky function that uses unexported functions from dplyr so it is not future proof, but you can specify a different summary for each column.
summarise_with <- function(.tbl, .funs) {
funs <- enquo(.funs)
syms <- syms(tbl_vars(.tbl))
calls <- dplyr:::as_fun_list(.funs, funs, caller_env())
stopifnot(length(syms)==length(calls))
cols <- purrr::map2(calls, syms, ~dplyr:::expr_substitute(.x, quote(.), .y))
cols <- purrr::set_names(cols, purrr::map_chr(syms, rlang::as_string))
summarize(.tbl, !!!cols)
}
Then you could do
df %>% summarise_with(list(mean, sum))
and not have to type the column names at all.
It seems like you can use map2
for this.
map2_dfc( df[v], f, ~.y(.x))
# # A tibble: 1 x 2
# potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
# <dbl> <int>
# 1 5.5 255