In this post
select group before certain observations separated by grouping var in R with NA control, when using one group add
na.rm=T
works.
But new data, where three groups
data=structure(list(add = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"),
x1 = c(0L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 3L, 0L, 0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), add1 = c(514L, 514L,
514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L,
514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L,
514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L,
514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L,
514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L, 514L
), group = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("female",
"male"), class = "factor"), add2 = c(2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L)), .Names = c("add", "x1", "add1",
"group", "add2"), class = "data.frame", row.names = c(NA, -52L
))
so when i run code
library(tidyverse)
library( data.table)
data %>%
group_by(add,add1,add2) %>%
mutate(group2 = rleid(group)) %>%
group_by(add,add1,add2, group, group2) %>%
mutate(MEAN = mean(x1[group=="male" & group2==1], na.rm = T), ## extra code here ##
Q25 = quantile(x1[group=="male" & group2==1], 0.25, na.rm = T)) %>% ## extra code here ##
group_by(add,add1,add2) %>%
mutate(x1 = ifelse(group=="male" & group2==3 & x1 > unique(Q25[!is.na(Q25)]), unique(MEAN[!is.na(MEAN)]), x1))%>%
ungroup() %>%
select(-group2) %>%
data.frame()
i get error
Error in mutate_impl(.data, dots) :
Column `x1` must be length 24 (the group size) or one, not 0
PS. I just provided one example to give data structure, cause there are 1000 groups. I can't find group
from which there is error
how to fix this error.
If I understand correctly, the error is caused by a first male group where all x1
are NA
in the first section (group == 1L
).
IMHO, a cleaner approach is to compute the statistics for all groups first as suggested here and to use a a non-equi join to update the affected rows in the second male group as suggested here.
library( data.table)
grp_stats <- setDT(data)[, group2 :=rleid(group), by = .(add, add1, add2)][
group2 == 1L & group == "male",
.(group2 = 3L, mean = mean(x1, na.rm = TRUE), q25 = quantile(x1, 0.25, na.rm = TRUE)),
by = .(add, add1, add2)]
grp_stats
add add1 add2 group2 mean q25
1: x 514 2018 3 1.5 1.25
2: y 515 2018 3 NaN NA
3: z 516 2018 3 2.0 2.00
The groups which produce wrong statistics can be cleary identified. It's up to the OP to remove the affected groups from the dataset.
However, for the subsequent join we can leave them in as they will not have any affect.
The column group2
with the constant value 3
already has been added to the group statistics to simplify the subsequent update in a non-equi join
:
data[, x1 := as.numeric(x1)][
grp_stats, on = .(group2, add, add1, add2, x1 > q25), x1 := mean][]
data
add x1 add1 group add2 group2
1: x 1.0 514 male 2018 1
2: x 2.0 514 male 2018 1
3: x NA 514 female 2018 2
4: x NA 514 female 2018 2
5: x 1.5 514 male 2018 3
6: x 1.0 514 male 2018 3
7: y NA 515 male 2018 1
8: y NA 515 male 2018 1
9: y NA 515 female 2018 2
10: y NA 515 female 2018 2
11: y 7.0 515 male 2018 3
12: y 1.0 515 male 2018 3
13: z 2.0 516 male 2018 1
14: z NA 516 male 2018 1
15: z NA 516 female 2018 2
16: z NA 516 female 2018 2
17: z 2.0 516 male 2018 3
18: z 1.0 516 male 2018 3
Note that rows 5 and 17 have been updated while the rows of the second group which produced the wrong statitistics haven't been touched.
x1
is coerced to type numeric
before joining to match the type of the result returned by mean()
.
Sample data
Here is a sample data consisting of three groups. In the seocnd group, all x1
values of the first male section are NA
.
data <- data.table::fread("
add x1 add1 group add2
x 1 514 male 2018
x 2 514 male 2018
x NA 514 female 2018
x NA 514 female 2018
x 7 514 male 2018
x 1 514 male 2018
y NA 515 male 2018
y NA 515 male 2018
y NA 515 female 2018
y NA 515 female 2018
y 7 515 male 2018
y 1 515 male 2018
z 2 516 male 2018
z NA 516 male 2018
z NA 516 female 2018
z NA 516 female 2018
z 7 516 male 2018
z 1 516 male 2018
")
Verify error message is caused by an all-NA first male group
When above sample dataset is piped into OP's code we can reproduce the error message:
library(dplyr)
data %>%
group_by(add,add1,add2) %>%
mutate(group2 = rleid(group)) %>%
group_by(add,add1,add2, group, group2) %>%
mutate(MEAN = mean(x1[group=="male" & group2==1], na.rm = T), ## extra code here ##
Q25 = quantile(x1[group=="male" & group2==1], 0.25, na.rm = T)) %>% ## extra code here ##
group_by(add,add1,add2) %>%
mutate(x1 = ifelse(group=="male" & group2==3 & x1 > unique(Q25[!is.na(Q25)]), unique(MEAN[!is.na(MEAN)]), x1))%>%
ungroup() %>%
select(-group2) %>%
data.frame()
Error in mutate_impl(.data, dots) :
Column x1
must be length 6 (the group size) or one, not 0