I'm using the dpylr package to count missing values for subgroups for each of my variables.
I used a mini-function:
NAobs <- function(x) length(x[is.na(x)]) ####function to count missing data for variables
to count missing values. Because I have quite some variables and I wanted to add a bit more information (sample size per group, and percentage of missing data per group) I wrote the following code, and inserted one variable (task_1) to check it.
library(dplyr)
group_by(DataRT, class) %>%
summarise(class_size=length(class), missing = NAobs(task_1), perc.= missing/class_size)
This works very well and I receive a table like this:
class class_size missing perc.
(dbl) (int) (int) (dbl)
1 1 25 2 0.08000000
2 2 25 1 0.04000000
3 3 25 3 0.12000000
4 4 25 4 0.16000000
5 5 24 3 0.12500000
6 6 29 6 0.20689655
...
In the next step, I wanted to generalize my command by including it into a function:
missing<-function(x, print=TRUE){
group_by(DataRT, class) %>%
summarise(class_size=length(class),
missing = NAobs(x),
perc.= missing/class_size)}
Optimally, I now could write missing(task_1) and would get the same table, but instead NAobs(x) ignores the grouping variable (class) and I receive a table like this:
class class_size missing perc.
(dbl) (int) (int) (dbl)
1 1 25 59 2.360000
2 2 25 59 2.360000
3 3 25 59 2.360000
4 4 25 59 2.360000
5 5 24 59 2.458333
6 6 29 59 2.034483
...
So what happens is that the column "missing" only shows the total number of NA cases for task_1, ignoring the groups; and replacing NAobs(x) with NAobs(variable name) to fix this issue would ruin the purpose of writing a function in the first place. How could I calculate the number of missing cases per group without having to copy the code and changing the variable name each time? Thank you!