How do summarize this data table with dplyr, then

2019-09-08 08:20发布

This question was embedded in another question I asked here, but as it goes beyond the scope of what I wanted to know in the initial inquiry, I thought it might deserve a separate thread.

I've been trying to come up with a solution for this problem based on the answers I have received here and here using dplyr and the functions written by Khashaa and Jaap.

Using the solutions provided to me (especially from Jaap), I have been able to summarize the raw data I received into a matrix-looking data table

dput(SO_Example_v1)
structure(list(Type = structure(c(3L, 1L, 2L), .Label = c("Community", 
"Contaminant", "Healthcare"), class = "factor"), hosp1_WoundAssocType = c(464L, 
285L, 24L), hosp1_BloodAssocType = c(73L, 40L, 26L), hosp1_UrineAssocType = c(75L, 
37L, 18L), hosp1_RespAssocType = c(137L, 77L, 2L), hosp1_CathAssocType = c(80L, 
34L, 24L), hosp2_WoundAssocType = c(171L, 115L, 17L), hosp2_BloodAssocType = c(127L, 
62L, 12L), hosp2_UrineAssocType = c(50L, 29L, 6L), hosp2_RespAssocType = c(135L, 
142L, 6L), hosp2_CathAssocType = c(95L, 24L, 12L)), .Names = c("Type", 
"hosp1_WoundAssocType", "hosp1_BloodAssocType", "hosp1_UrineAssocType", 
"hosp1_RespAssocType", "hosp1_CathAssocType", "hosp2_WoundAssocType", 
"hosp2_BloodAssocType", "hosp2_UrineAssocType", "hosp2_RespAssocType", 
"hosp2_CathAssocType"), class = "data.frame", row.names = c(NA, 
-3L))

Which looks as follows

require(dplyr)
df <- tbl_df(SO_Example_v1)
head(df)
         Type hosp1_WoundAssocType hosp1_BloodAssocType hosp1_UrineAssocType
1  Healthcare                  464                   73                   75
2   Community                  285                   40                   37
3 Contaminant                   24                   26                   18
Variables not shown: hosp1_RespAssocType (int), hosp1_CathAssocType (int), hosp2_WoundAssocType
  (int), hosp2_BloodAssocType (int), hosp2_UrineAssocType (int), hosp2_RespAssocType (int),
  hosp2_CathAssocType (int)

The column Type is the type of bacteria, the following columns represent where they were cultured. The digits represent the number of times the respective type of bacteria were detected.

I know what my final table should look like, but until now I have been doing it step by step for each comparison and variable and there must undoubtedly be a way to do this by piping multiple functions in dplyr - but alas, I have not found the answer on SO to this.

Example of what final table should look like

                                                 Wound
Type                            n Hospital 1 (%)      n Hospital 2 (%)  p-val
Healthcare associated bacteria     464 (60.0)            171 (56.4)     0.28
Community associated bacteria      285 (36.9)            115 (38.0)     0.74
Contaminants                       24 (3.1)              17 (5.6)       0.05

Where the first grouping variable "Wound" is then subsequently replaced by "Urine", "Respiratory", ... and then there's a final column termed "All/Total", which is the total number of times each variable in the rows of "Type" was found and summarized across Hospital 1 and 2 and then compared.

What I have done until now is the following and very tedious, as it's calculated "by hand" and then I poulate the table with all of the results manually.

### Wound cultures & healthcare associated (extracted manually)
# hosp1 464 (yes), 309 (no), 773 wound isolates in total; (% = 464 / 309 * 100)
# hosp2 171 (yes), 132 (no), 303 would isolates in total; (% = 171 / 303 * 100)

### Then the chisq.test of my contingency table
chisq.test(cbind(c(464,309),c(171,132)),correct=FALSE)

I appreciate that if I run a piped dplyr on the raw data.frame I won't be able to get the exact formatting of my desired table, but there must be a way to at least automate all the steps here and put the results together in a final table that I can export as a .csv file and then just do some final column editing etc.?

Any help is greatly appreciated.

1条回答
【Aperson】
2楼-- · 2019-09-08 09:00

It's ugly, but it works (Sam in the comments is right that this whole issue should probably be addressed by adjusting your data to a clean format before analysing, but anyway):

Map(
  function(x,y) {
    out <- cbind(x,y)
    final <- rbind(out[1,],colSums(out[2:3,]))
    chisq.test(final,correct=FALSE)
  },
  SO_Example_v1[grepl("^hosp1",names(SO_Example_v1))],
  SO_Example_v1[grepl("^hosp2",names(SO_Example_v1))] 
)

#$hosp1_WoundAssocType
#
#        Pearson's Chi-squared test
#
#data:  final
#X-squared = 1.16, df = 1, p-value = 0.2815
# etc etc...

Matches your intended result:

chisq.test(cbind(c(464,309),c(171,132)),correct=FALSE)
#
#        Pearson's Chi-squared test
# 
#data:  cbind(c(464, 309), c(171, 132))
#X-squared = 1.16, df = 1, p-value = 0.2815
查看更多
登录 后发表回答