Sub-function in grouping function using dplyr

2019-09-05 22:32发布

问题:

I'm using the dpylr package to count missing values for subgroups for each of my variables.

I used a mini-function:

NAobs <- function(x) length(x[is.na(x)]) ####function to count missing data for variables

to count missing values. Because I have quite some variables and I wanted to add a bit more information (sample size per group, and percentage of missing data per group) I wrote the following code, and inserted one variable (task_1) to check it.

library(dplyr)
group_by(DataRT, class) %>%
  summarise(class_size=length(class), missing = NAobs(task_1), perc.= missing/class_size)

This works very well and I receive a table like this:

   class class_size missing      perc.
   (dbl)      (int)   (int)      (dbl)
1      1         25       2 0.08000000
2      2         25       1 0.04000000
3      3         25       3 0.12000000
4      4         25       4 0.16000000
5      5         24       3 0.12500000
6      6         29       6 0.20689655
...

In the next step, I wanted to generalize my command by including it into a function:

missing<-function(x, print=TRUE){
            group_by(DataRT, class) %>%
                    summarise(class_size=length(class), 
                        missing = NAobs(x),
                        perc.= missing/class_size)}

Optimally, I now could write missing(task_1) and would get the same table, but instead NAobs(x) ignores the grouping variable (class) and I receive a table like this:

   class class_size missing    perc.
   (dbl)      (int)   (int)    (dbl)
1      1         25      59 2.360000
2      2         25      59 2.360000
3      3         25      59 2.360000
4      4         25      59 2.360000
5      5         24      59 2.458333
6      6         29      59 2.034483
...

So what happens is that the column "missing" only shows the total number of NA cases for task_1, ignoring the groups; and replacing NAobs(x) with NAobs(variable name) to fix this issue would ruin the purpose of writing a function in the first place. How could I calculate the number of missing cases per group without having to copy the code and changing the variable name each time? Thank you!

回答1:

New dplyr update. The newest dplyr will be able to solve this with two new functions enquo and !!. The first quotes the input like substitute would, the second unquotes it for evaluation. For more on programming with dplyr, see this vignette

You will need the developer's version of dplyr, and I would also suggest the newest rlang install

#install developer's version until new release in May
library(dplyr) #0.5.0.9004+

#Setup
set.seed(143)
NAobs <- function(x) length(x[is.na(x)])
DataRT <- data.frame(class = sample(1:6, 25, TRUE), task1 = sample(c(NA,1), 25, TRUE),
                     task2 = sample(c(NA,1), 25, TRUE))
f <- function(x) {
  my_var <- enquo(x)
  group_by(DataRT, class) %>%
    summarise(class_size=length(class), 
    missing = NAobs(!!my_var),
    perc.= missing/class_size)
}
f(task1)
# # A tibble: 6 × 4
#   class class_size missing     perc.
#   <int>      <int>   <int>     <dbl>
# 1     1          5       0 0.0000000
# 2     2          4       2 0.5000000
# 3     3          3       0 0.0000000
# 4     4          1       0 0.0000000
# 5     5          5       3 0.6000000
# 6     6          7       3 0.4285714