Automate Chi-square across categories and columns

2019-04-09 13:58发布

问题:

I have a survey dataframe containing several questions (columns) coded as 1=agree/0=disagree. Respondents (rows) are categorized according to metrics "age" ("young","middle","old"), "region" ("East","Mid","West"), etc. There are around 30 categories in total (3 ages, 3 regions, 2 genders, 11 occupations, etc.). Within each metric, categories are non-overlapping and of different sizes.

This simulates a cut-down version of the dataset:

n<-400
set.seed(1)
data<-data.frame(age=sample(c('young','middle','old'),n,replace=T),region=sample(c('East','Mid','West'),n,replace=T),gender=sample(c('M','F'),n,replace=T),Q15a=sample(c(0,1),n,replace=T),Q15b=sample(c(0,1),n,replace=T))

I can use Chi-square to test if the responses in, say, the West differ significantly from the total sample, for Q15a, with:

attach(data)
chisq.test(table(subset(data,region=='West')$Q15a),p=table(Q15a),rescale.p=T)

I want to test all categories against the total sample for Q15a, and then for ~20 other questions. As there are around 30 tests per question, I want to find a way (efficient or otherwise) to automate this, but am struggling to see how to get R to do this itself or how to write a loop to cycle through the categories. I've searched[1], and got sidetracked into pairwise comparison testing with pairwise.prop.test(), but haven't found anything that really answers this yet.

[1] similar but not duplicate questions (both are column-wise tests):

Using loops to do Chi-Square Test in R

Chi Square Analysis using for loop in R

回答1:

How about this?

# find all question columns containing Q, your "subset" may differ
nms <- names(data)
nms <- nms[grepl("Q", nms)]

result <- sapply(nms, FUN = function(x, data) {
  qinq <- data[, c("region", x)]
  by(data = qinq, INDICES = data$region, FUN = function(y, qinq) {
    chisq.test(table(y[, x]), p =  table(qinq[, x]), rescale.p = TRUE)
  }, qinq = qinq)
}, data = data, simplify = FALSE)

$Q15a
data$region: East

    Chi-squared test for given probabilities

data:  table(y[, x])
X-squared = 0.7494, df = 1, p-value = 0.3867

--------------------------------------------------------------------------------------------- 
data$region: Mid

    Chi-squared test for given probabilities

data:  table(y[, x])
X-squared = 0.2249, df = 1, p-value = 0.6353

--------------------------------------------------------------------------------------------- 
data$region: West

    Chi-squared test for given probabilities

data:  table(y[, x])
X-squared = 1.5877, df = 1, p-value = 0.2077


$Q15b
data$region: East

    Chi-squared test for given probabilities

data:  table(y[, x])
X-squared = 0.0697, df = 1, p-value = 0.7918

--------------------------------------------------------------------------------------------- 
data$region: Mid

    Chi-squared test for given probabilities

data:  table(y[, x])
X-squared = 0, df = 1, p-value = 0.9987

--------------------------------------------------------------------------------------------- 
data$region: West

    Chi-squared test for given probabilities

data:  table(y[, x])
X-squared = 0.056, df = 1, p-value = 0.8129

You can extract anything you want. Here's how you would extract a p.value.

lapply(result, FUN = function(x) lapply(x, "[", "p.value"))

$Q15a
$Q15a$East
$Q15a$East$p.value
[1] 0.3866613


$Q15a$Mid
$Q15a$Mid$p.value
[1] 0.6353457


$Q15a$West
$Q15a$West$p.value
[1] 0.2076507



$Q15b
$Q15b$East
$Q15b$East$p.value
[1] 0.7918426


$Q15b$Mid
$Q15b$Mid$p.value
[1] 0.9986924


$Q15b$West
$Q15b$West$p.value
[1] 0.8128969

Happy formatting.



回答2:

You may also use chisq.desc() function from EnQuireR package. It worked fine for me. ALthough there is very less support available and this package is quite old (no updates from long), so few functions were not working but I find chisq.desc() useful. It Color the cells of the table containing the results from the Chi-square test, crossing all the selected categorical variables, according to a selected threshold. I am unable to comment, so writing as an answer.



标签: r chi-squared