I'm trying to conduct a chi square test on several groups within the dplyr frame. The problem is, group_by() %>% summarise()
doesn't seem to do trick.
Simulated data (same structure as problematic data, but random, so p.values should be high)
set.seed(1)
data.frame(partido=sample(c("PRI", "PAN"), 100, 0.6),
genero=sample(c("H", "M"), 100, 0.7),
GM=sample(c("Bajo", "Muy bajo"), 100, 0.8)) -> foo
I want to compare several groups defined by GM to see if there are changes in the p.values for the crosstab of partido and genero, conditional to GM.
The obvious dplyr way should be:
foo %>%
group_by(GM) %>%
summarise(pvalue=chisq.test(.$partido, .$genero)$p.value) #just the p.value, so summarise is happy
But I get the p.values for the ungrouped data, just to times, not the p.value for each table:
# A tibble: 2 × 2
GM pvalue
<fctr> <dbl>
1 Bajo 0.8660521
2 Muy bajo 0.8660521
Testing each group using filter I get:
foo %>%
filter(GM=="Bajo") %$%
table(partido, genero) %>%
chisq.test()
Returns: X-squared = 0.015655, df = 1, p-value = 0.9004
foo %>%
filter(GM=="Muy bajo") %$%
table(partido, genero) %>% chisq.test()
Returns: X-squared = 0.50409, df = 1, p-value = 0.4777
dplyr:summarise()
works with functions with more than one argument, so this shouldn't be the problem:
data.frame(a=1:10, b=10:1, c=sample(c("Grupo 1", "Grupo 2"), 10, 0.5)) %>%
group_by(c) %>%
summarise(r=cor(a, b))
works like charm. It just doesn't seem to work with chisq.test.
I managed to get what I wanted with nested models using tidyr::nest()
and purrr::map()
, but I find the code cumbersome --at least for my students. Actually, I´ve invested many ours teaching them (a very math and programming challenged group) dplyr so they could avoid vector functions as much as possible.
foo %>%
nest(-GM) %>%
mutate(tabla=map(data, ~table(.))) %>%
mutate(pvalue=map(tabla, ~chisq.test(.)$p.value)) %>%
select(GM, pvalue) %>%
unnest()
A tibble: 2 × 2
GM pvalue
<fctr> <dbl>
1 Bajo 0.9004276
2 Muy bajo 0.4777095
do()
does the trick too:
foo %>%
group_by(GM) %>%
do(tidy(chisq.test(.$partido, .$genero)))
Source: local data frame [2 x 5]
Groups: GM [2]
GM statistic p.value parameter
<fctr> <dbl> <dbl> <int>
1 Bajo 0.0156553 0.9004276 1
2 Muy bajo 0.5040878 0.4777095 1
# ... with 1 more variables: method <fctr>
as in: Fisher's and Pearson's test for indepedence
But, ¿why doesn't group_by()
work with summarise(chisq.test()$p.value)
?
In
dplyr
you can generally just use unquoted variable names to access the relevant columns, whether you're in a groupby or otherwise. So removing the.$
accessors from.$partido
and.$genero
which are not needed I get: