I have a survey dataframe containing several questions (columns) coded as 1=agree/0=disagree. Respondents (rows) are categorized according to metrics "age" ("young","middle","old"), "region" ("East","Mid","West"), etc. There are around 30 categories in total (3 ages, 3 regions, 2 genders, 11 occupations, etc.). Within each metric, categories are non-overlapping and of different sizes.
This simulates a cut-down version of the dataset:
n<-400
set.seed(1)
data<-data.frame(age=sample(c('young','middle','old'),n,replace=T),region=sample(c('East','Mid','West'),n,replace=T),gender=sample(c('M','F'),n,replace=T),Q15a=sample(c(0,1),n,replace=T),Q15b=sample(c(0,1),n,replace=T))
I can use Chi-square to test if the responses in, say, the West differ significantly from the total sample, for Q15a, with:
attach(data)
chisq.test(table(subset(data,region=='West')$Q15a),p=table(Q15a),rescale.p=T)
I want to test all categories against the total sample for Q15a, and then for ~20 other questions. As there are around 30 tests per question, I want to find a way (efficient or otherwise) to automate this, but am struggling to see how to get R to do this itself or how to write a loop to cycle through the categories. I've searched[1], and got sidetracked into pairwise comparison testing with pairwise.prop.test(), but haven't found anything that really answers this yet.
[1] similar but not duplicate questions (both are column-wise tests):
Using loops to do Chi-Square Test in R
Chi Square Analysis using for loop in R
How about this?
# find all question columns containing Q, your "subset" may differ
nms <- names(data)
nms <- nms[grepl("Q", nms)]
result <- sapply(nms, FUN = function(x, data) {
qinq <- data[, c("region", x)]
by(data = qinq, INDICES = data$region, FUN = function(y, qinq) {
chisq.test(table(y[, x]), p = table(qinq[, x]), rescale.p = TRUE)
}, qinq = qinq)
}, data = data, simplify = FALSE)
$Q15a
data$region: East
Chi-squared test for given probabilities
data: table(y[, x])
X-squared = 0.7494, df = 1, p-value = 0.3867
---------------------------------------------------------------------------------------------
data$region: Mid
Chi-squared test for given probabilities
data: table(y[, x])
X-squared = 0.2249, df = 1, p-value = 0.6353
---------------------------------------------------------------------------------------------
data$region: West
Chi-squared test for given probabilities
data: table(y[, x])
X-squared = 1.5877, df = 1, p-value = 0.2077
$Q15b
data$region: East
Chi-squared test for given probabilities
data: table(y[, x])
X-squared = 0.0697, df = 1, p-value = 0.7918
---------------------------------------------------------------------------------------------
data$region: Mid
Chi-squared test for given probabilities
data: table(y[, x])
X-squared = 0, df = 1, p-value = 0.9987
---------------------------------------------------------------------------------------------
data$region: West
Chi-squared test for given probabilities
data: table(y[, x])
X-squared = 0.056, df = 1, p-value = 0.8129
You can extract anything you want. Here's how you would extract a p.value.
lapply(result, FUN = function(x) lapply(x, "[", "p.value"))
$Q15a
$Q15a$East
$Q15a$East$p.value
[1] 0.3866613
$Q15a$Mid
$Q15a$Mid$p.value
[1] 0.6353457
$Q15a$West
$Q15a$West$p.value
[1] 0.2076507
$Q15b
$Q15b$East
$Q15b$East$p.value
[1] 0.7918426
$Q15b$Mid
$Q15b$Mid$p.value
[1] 0.9986924
$Q15b$West
$Q15b$West$p.value
[1] 0.8128969
Happy formatting.
You may also use chisq.desc() function from EnQuireR package. It worked fine for me. ALthough there is very less support available and this package is quite old (no updates from long), so few functions were not working but I find chisq.desc() useful.
It Color the cells of the table containing the results from the Chi-square test, crossing all the selected categorical variables, according to a selected threshold. I am unable to comment, so writing as an answer.