na.rm function doesn't work if use more then 1

2019-08-29 09:42发布

问题:

In this post select group before certain observations separated by grouping var in R with NA control, when using one group add na.rm=T works. But new data, where three groups

data=structure(list(add = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"), 
    x1 = c(0L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 
    1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 3L, 0L, 0L, 
    0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), add1 = c(514L, 514L, 
    514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 
    514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 
    514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 
    514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 
    514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L
    ), group = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("female", 
    "male"), class = "factor"), add2 = c(2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L)), .Names = c("add", "x1", "add1", 
"group", "add2"), class = "data.frame", row.names = c(NA, -52L
))

so when i run code

library(tidyverse)
library( data.table)
data %>%  
  group_by(add,add1,add2) %>%                                          
  mutate(group2 = rleid(group)) %>% 
  group_by(add,add1,add2, group, group2) %>%
  mutate(MEAN = mean(x1[group=="male" & group2==1], na.rm = T),      ## extra code here ##    
         Q25 = quantile(x1[group=="male" & group2==1], 0.25, na.rm = T)) %>%  ## extra code here ##
  group_by(add,add1,add2) %>%                                           
  mutate(x1 = ifelse(group=="male" & group2==3 & x1 > unique(Q25[!is.na(Q25)]), unique(MEAN[!is.na(MEAN)]), x1))%>%
  ungroup() %>%
  select(-group2) %>%
  data.frame()

i get error

Error in mutate_impl(.data, dots) : 
  Column `x1` must be length 24 (the group size) or one, not 0

PS. I just provided one example to give data structure, cause there are 1000 groups. I can't find group from which there is error

how to fix this error.

回答1:

If I understand correctly, the error is caused by a first male group where all x1 are NA in the first section (group == 1L).

IMHO, a cleaner approach is to compute the statistics for all groups first as suggested here and to use a a non-equi join to update the affected rows in the second male group as suggested here.

library( data.table)
grp_stats <- setDT(data)[, group2 :=rleid(group), by = .(add, add1, add2)][
  group2 == 1L & group == "male", 
  .(group2 = 3L, mean = mean(x1, na.rm = TRUE), q25 = quantile(x1, 0.25, na.rm = TRUE)), 
  by = .(add, add1, add2)] 
grp_stats 
   add add1 add2 group2 mean  q25
1:   x  514 2018      3  1.5 1.25
2:   y  515 2018      3  NaN   NA
3:   z  516 2018      3  2.0 2.00

The groups which produce wrong statistics can be cleary identified. It's up to the OP to remove the affected groups from the dataset.

However, for the subsequent join we can leave them in as they will not have any affect.

The column group2 with the constant value 3 already has been added to the group statistics to simplify the subsequent update in a non-equi join:

data[, x1 := as.numeric(x1)][
  grp_stats, on = .(group2, add, add1, add2, x1 > q25), x1 := mean][]
data
    add  x1 add1  group add2 group2
 1:   x 1.0  514   male 2018      1
 2:   x 2.0  514   male 2018      1
 3:   x  NA  514 female 2018      2
 4:   x  NA  514 female 2018      2
 5:   x 1.5  514   male 2018      3
 6:   x 1.0  514   male 2018      3
 7:   y  NA  515   male 2018      1
 8:   y  NA  515   male 2018      1
 9:   y  NA  515 female 2018      2
10:   y  NA  515 female 2018      2
11:   y 7.0  515   male 2018      3
12:   y 1.0  515   male 2018      3
13:   z 2.0  516   male 2018      1
14:   z  NA  516   male 2018      1
15:   z  NA  516 female 2018      2
16:   z  NA  516 female 2018      2
17:   z 2.0  516   male 2018      3
18:   z 1.0  516   male 2018      3

Note that rows 5 and 17 have been updated while the rows of the second group which produced the wrong statitistics haven't been touched.

x1 is coerced to type numeric before joining to match the type of the result returned by mean().

Sample data

Here is a sample data consisting of three groups. In the seocnd group, all x1 values of the first male section are NA.

data <- data.table::fread("
add x1 add1  group add2
x    1  514   male 2018
x    2  514   male 2018
x   NA  514 female 2018
x   NA  514 female 2018
x    7  514   male 2018
x    1  514   male 2018
y   NA  515   male 2018
y   NA  515   male 2018
y   NA  515 female 2018
y   NA  515 female 2018
y    7  515   male 2018
y    1  515   male 2018
z    2  516   male 2018
z   NA  516   male 2018
z   NA  516 female 2018
z   NA  516 female 2018
z    7  516   male 2018
z    1  516   male 2018
")

Verify error message is caused by an all-NA first male group

When above sample dataset is piped into OP's code we can reproduce the error message:

library(dplyr)
data %>% 
  group_by(add,add1,add2) %>%                                          
  mutate(group2 = rleid(group)) %>% 
  group_by(add,add1,add2, group, group2) %>%
  mutate(MEAN = mean(x1[group=="male" & group2==1], na.rm = T),      ## extra code here ##    
         Q25 = quantile(x1[group=="male" & group2==1], 0.25, na.rm = T)) %>%  ## extra code here ##
  group_by(add,add1,add2) %>%                                           
  mutate(x1 = ifelse(group=="male" & group2==3 & x1 > unique(Q25[!is.na(Q25)]), unique(MEAN[!is.na(MEAN)]), x1))%>%
  ungroup() %>%
  select(-group2) %>%
  data.frame()

Error in mutate_impl(.data, dots) :
Column x1 must be length 6 (the group size) or one, not 0