Get `chisq.test()$p.value` for several groups usin

I'm trying to conduct a chi square test on several groups within the dplyr frame. The problem is, group_by() %>% summarise() doesn't seem to do trick.

Simulated data (same structure as problematic data, but random, so p.values should be high)

set.seed(1)
data.frame(partido=sample(c("PRI", "PAN"), 100, 0.6),
       genero=sample(c("H", "M"), 100, 0.7), 
       GM=sample(c("Bajo", "Muy bajo"), 100, 0.8)) -> foo

I want to compare several groups defined by GM to see if there are changes in the p.values for the crosstab of partido and genero, conditional to GM.

The obvious dplyr way should be:

foo %>% 
  group_by(GM) %>% 
  summarise(pvalue=chisq.test(.$partido, .$genero)$p.value)  #just the p.value, so summarise is happy

But I get the p.values for the ungrouped data, just to times, not the p.value for each table:

# A tibble: 2 × 2 GM pvalue <fctr> <dbl> 1 Bajo 0.8660521 2 Muy bajo 0.8660521

Testing each group using filter I get:

foo %>% 
  filter(GM=="Bajo") %$% 
  table(partido, genero) %>% 
  chisq.test()

Returns: X-squared = 0.015655, df = 1, p-value = 0.9004

foo %>% 
  filter(GM=="Muy bajo") %$% 
  table(partido, genero) %>% chisq.test()

Returns: X-squared = 0.50409, df = 1, p-value = 0.4777

dplyr:summarise() works with functions with more than one argument, so this shouldn't be the problem:

data.frame(a=1:10, b=10:1, c=sample(c("Grupo 1", "Grupo 2"), 10, 0.5)) %>% 
    group_by(c) %>% 
    summarise(r=cor(a, b))

works like charm. It just doesn't seem to work with chisq.test.

I managed to get what I wanted with nested models using tidyr::nest() and purrr::map(), but I find the code cumbersome --at least for my students. Actually, I´ve invested many ours teaching them (a very math and programming challenged group) dplyr so they could avoid vector functions as much as possible.

foo %>% 
  nest(-GM) %>% 
  mutate(tabla=map(data, ~table(.))) %>% 
  mutate(pvalue=map(tabla, ~chisq.test(.)$p.value)) %>% 
  select(GM, pvalue) %>% 
  unnest()

A tibble: 2 × 2
       GM   pvalue
    <fctr>  <dbl>
1     Bajo  0.9004276
2 Muy bajo  0.4777095

do() does the trick too:

foo %>% 
  group_by(GM) %>% 
  do(tidy(chisq.test(.$partido, .$genero)))

Source: local data frame [2 x 5]
Groups: GM [2]
    GM statistic   p.value parameter
<fctr>     <dbl>     <dbl>     <int>
1     Bajo 0.0156553 0.9004276         1
2 Muy bajo 0.5040878 0.4777095         1
# ... with 1 more variables: method <fctr>

as in: Fisher's and Pearson's test for indepedence

But, ¿why doesn't group_by() work with summarise(chisq.test()$p.value)?

标签： r dplyr chi-squared tidyverse

1条回答

在下西门庆

2楼-- · 2019-05-23 06:54

In dplyr you can generally just use unquoted variable names to access the relevant columns, whether you're in a groupby or otherwise. So removing the .$ accessors from .$partido and .$genero which are not needed I get:

foo %>% 
    group_by(GM) %>% 
    summarise(pvalue= chisq.test(partido, genero)$p.value) 

# A tibble: 2 × 2
        GM    pvalue
    <fctr>     <dbl>
1     Bajo 0.9004276
2 Muy bajo 0.4777095

0人赞添加讨论(0) 举报

Get `chisq.test()$p.value` for several groups usin

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间