Succinct way to summarize different columns with d

2020-03-25 01:14发布

问题:

My question builds on a similar one by imposing an additional constraint that the name of each variable should appear only once.

Consider a data frame

library( tidyverse )
df <- tibble( potentially_long_name_i_dont_want_to_type_twice = 1:10,
              another_annoyingly_long_name = 21:30 )

I would like to apply mean to the first column and sum to the second column, without unnecessarily typing each column name twice.

As the question I linked above shows, summarize allows you to do this, but requires that the name of each column appears twice. On the other hand, summarize_at allows you to succinctly apply multiple functions to multiple columns, but it does so by calling all specified functions on all specified columns, instead of doing it in a one-to-one fashion. Is there a way to combine these distinct features of summarize and summarize_at?

I was able to hack it with rlang, but I'm not sure if it's any cleaner than just typing each variable twice:

v <- c("potentially_long_name_i_dont_want_to_type_twice",
       "another_annoyingly_long_name")
f <- list(mean,sum)

## Desired output
smrz <- set_names(v) %>% map(sym) %>% map2( f, ~rlang::call2(.y,.x) )
df %>% summarize( !!!smrz )
# # A tibble: 1 x 2
#   potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                             <dbl>                        <int>
# 1                                             5.5                          255

EDIT to address some philosophical points

I don’t think that wanting to avoid the x=f(x) idiom is unreasonable. I probably came across a bit overzealous about typing long names, but the real issue is actually having (relatively) long names that are very similar to each other. Examples include nucleotide sequences (e.g., AGCCAGCGGAAACAGTAAGG) and TCGA barcodes. Not only is autocomplete of limited utility in such cases, but writing things like AGCCAGCGGAAACAGTAAGG = sum( AGCCAGCGGAAACAGTAAGG ) introduces unnecessary coupling and increases the risk that the two sides of the assignment might accidentally go out of sync as the code is developed and maintained.

I completely agree with @MrFlick about dplyr increasing code readability, but I don’t think that readability should come at the cost of correctness. Functions like summarize_at and mutate_at are brilliant, because they strike a perfect balance between placing operations next to their operands (clarity) and guaranteeing that the result is written to the correct column (correctness).

By the same token, I feel that the proposed solutions which remove variable mention altogether swing too far in the other direction. While inherently clever -- and I certainly appreciate the extra typing they save -- I think that, by removing the association between functions and variable names, such solutions now rely on proper ordering of variables, which creates its own risks of accidental errors.

In short, I believe that a self-mutating / self-summarizing operation should mention each variable name exactly once.

回答1:

I propose 2 tricks to solve this issue, see the code and some details for both solutions at the bottom :

A function .at that returns results for for groups of variables (here only one variable by group) that we can then unsplice, so we benefit from both worlds, summarize and summarize_at :

df %>% summarize(
  !!!.at(vars(potentially_long_name_i_dont_want_to_type_twice), mean),
  !!!.at(vars(another_annoyingly_long_name), sum))

# # A tibble: 1 x 2
#     potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                               <dbl>                        <dbl>
#   1                                             5.5                          255

An adverb to summarize, with a dollar notation shorthand.

df %>%
  ..flx$summarize(potentially_long_name_i_dont_want_to_type_twice = ~mean(.),
                  another_annoyingly_long_name = ~sum(.))

# # A tibble: 1 x 2
#     potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                               <dbl>                        <int>
#   1                                             5.5                          255

code for .at

It has to be used in a pipe because it uses the . in the parent environment, messy but it works.

.at <- function(.vars, .funs, ...) {
  in_a_piped_fun <- exists(".",parent.frame()) &&
    length(ls(envir=parent.frame(), all.names = TRUE)) == 1
  if (!in_a_piped_fun)
    stop(".at() must be called as an argument to a piped function")
  .tbl <- try(eval.parent(quote(.)))
  dplyr:::manip_at(
    .tbl, .vars, .funs, rlang::enquo(.funs), rlang:::caller_env(),
    .include_group_vars = TRUE, ...)
}

I designed it to combine summarize and summarize_at :

df %>% summarize(
  !!!.at(vars(potentially_long_name_i_dont_want_to_type_twice), list(foo=min, bar = max)),
  !!!.at(vars(another_annoyingly_long_name), median))

# # A tibble: 1 x 3
#       foo   bar another_annoyingly_long_name
#     <dbl> <dbl>                        <dbl>
#   1     1    10                         25.5

code for ..flx

..flx outputs a function that replaces its formula arguments such as a = ~mean(.) by calls a = purrr::as_mapper(~mean(.))(a) before running. Convenient with summarize and mutate because a column cannot be a formula so there can't be any conflict.

I like to use the dollar notation as a shorthand and to have names starting with .. so I can name those "tags" (and give them a class "tag") and see them as different objects (still experimenting with this). ..flx(summarize)(...) will work as well though.

..flx <- function(fun){
  function(...){
    mc <- match.call()
    mc[[1]] <- tail(mc[[1]],1)[[1]]
    mc[] <- imap(mc,~if(is.call(.) && identical(.[[1]],quote(`~`))) {
      rlang::expr(purrr::as_mapper(!!.)(!!sym(.y))) 
    } else .)
    eval.parent(mc)
  }
}

class(..flx) <- "tag"

`$.tag` <- function(e1, e2){
  # change original call so x$y, which is `$.tag`(tag=x, data=y), becomes x(y)
  mc <- match.call()
  mc[[1]] <- mc[[2]]
  mc[[2]] <- NULL
  names(mc) <- NULL
  # evaluate it in parent env
  eval.parent(mc)
}


回答2:

Use .[[i]] and !!names(.)[i]:= to refer to the ith column and its name.

library(tibble)
library(dplyr)
library(rlang)

df %>% summarize(!!names(.)[1] := mean(.[[1]]), !!names(.)[2] := sum(.[[2]])) 

giving:

# A tibble: 1 x 2
  potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
                                            <dbl>                        <int>
1                                             5.5                          255

Update

If df were grouped (it is not in the question so this is not needed) then surround summarize with a do like this:

library(dplyr)
library(rlang)
library(tibble)

df2 <- tibble(a = 1:10, b = 11:20, g = rep(1:2, each = 5))

df2 %>%
  group_by(g) %>%
  do(summarize(., !!names(.)[1] := mean(.[[1]]), !!names(.)[2] := sum(.[[2]]))) %>%
  ungroup

giving:

# A tibble: 2 x 3
      g     a     b
  <int> <dbl> <int>
1     1     3    65
2     2     8    90


回答3:

Here's a hacky function that uses unexported functions from dplyr so it is not future proof, but you can specify a different summary for each column.

summarise_with <- function(.tbl, .funs) {
  funs <- enquo(.funs)
  syms <- syms(tbl_vars(.tbl))
  calls <- dplyr:::as_fun_list(.funs, funs, caller_env())
  stopifnot(length(syms)==length(calls))
  cols <- purrr::map2(calls, syms, ~dplyr:::expr_substitute(.x, quote(.), .y))
  cols <- purrr::set_names(cols, purrr::map_chr(syms, rlang::as_string))
  summarize(.tbl, !!!cols)
}

Then you could do

df %>% summarise_with(list(mean, sum))

and not have to type the column names at all.



回答4:

It seems like you can use map2 for this.

map2_dfc( df[v], f, ~.y(.x))

# # A tibble: 1 x 2
#   potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                             <dbl>                        <int>
# 1                                             5.5                          255


标签: r dplyr