I'm using the dpylr package to count missing values for subgroups for each of my variables.
I used a mini-function:
NAobs <- function(x) length(x[is.na(x)]) ####function to count missing data for variables
to count missing values. Because I have quite some variables and I wanted to add a bit more information (sample size per group, and percentage of missing data per group) I wrote the following code, and inserted one variable (task_1) to check it.
library(dplyr)
group_by(DataRT, class) %>%
summarise(class_size=length(class), missing = NAobs(task_1), perc.= missing/class_size)
This works very well and I receive a table like this:
class class_size missing perc.
(dbl) (int) (int) (dbl)
1 1 25 2 0.08000000
2 2 25 1 0.04000000
3 3 25 3 0.12000000
4 4 25 4 0.16000000
5 5 24 3 0.12500000
6 6 29 6 0.20689655
...
In the next step, I wanted to generalize my command by including it into a function:
missing<-function(x, print=TRUE){
group_by(DataRT, class) %>%
summarise(class_size=length(class),
missing = NAobs(x),
perc.= missing/class_size)}
Optimally, I now could write missing(task_1) and would get the same table, but instead NAobs(x) ignores the grouping variable (class) and I receive a table like this:
class class_size missing perc.
(dbl) (int) (int) (dbl)
1 1 25 59 2.360000
2 2 25 59 2.360000
3 3 25 59 2.360000
4 4 25 59 2.360000
5 5 24 59 2.458333
6 6 29 59 2.034483
...
So what happens is that the column "missing" only shows the total number of NA cases for task_1, ignoring the groups; and replacing NAobs(x) with NAobs(variable name) to fix this issue would ruin the purpose of writing a function in the first place. How could I calculate the number of missing cases per group without having to copy the code and changing the variable name each time? Thank you!
New dplyr update. The newest dplyr will be able to solve this with two new functions
enquo
and!!
. The first quotes the input likesubstitute
would, the second unquotes it for evaluation. For more on programming with dplyr, see this vignetteYou will need the developer's version of dplyr, and I would also suggest the newest rlang install