R: How to pivot and count data.frame (ex: list of

2019-08-31 11:53发布

问题:

I'm trying to get better with dplyr and tidyr but I'm not used to "thinking in R". An example may be best. The table I've generated from my data in sql looks like this:

╔═══════════╦════════════╦═════╦════════╦══════════════╦══════════╦══════════════╗
║ patientid ║ had_stroke ║ age ║ gender ║ hypertension ║ diabetes ║ estrogen HRT ║
╠═══════════╬════════════╬═════╬════════╬══════════════╬══════════╬══════════════╣
║ 934988    ║          1 ║  65 ║ M      ║            1 ║        1 ║            0 ║
║ 94044     ║          0 ║  69 ║ F      ║            1 ║        0 ║            0 ║
║ 689348    ║          0 ║  56 ║ F      ║            0 ║        1 ║            1 ║
║ 902498    ║          1 ║  45 ║ M      ║            0 ║        0 ║            1 ║
║ …         ║            ║     ║        ║              ║          ║              ║
╚═══════════╩════════════╩═════╩════════╩══════════════╩══════════╩══════════════╝

I would like to create an output table that conveys the following information:

╔══════════════╦════════╦══════════╦══════════╦══════════╦═══════════╗
║              ║ total  ║M lt50 yo ║F lt50 yo ║M gte50yo ║F gte 50yo ║
╠══════════════╬════════╬══════════╬══════════╬══════════╬═══════════╣
║ estrogen HRT ║    347 ║        2 ║       65 ║        4 ║        97 ║
║ diabetes     ║  13922 ║       54 ║       73 ║      192 ║       247 ║
║ hypertension ║   8210 ║      102 ║      187 ║      443 ║       574 ║
╚══════════════╩════════╩══════════╩══════════╩══════════╩═══════════╝

Total is the total number of patients with that comorbidity (easy enough: sum(data$estrogen == 1) etc). The other cells are now the number of patients with that comorbidity in that age and gender stratification where had_stroke==1.

I'd love to just get a general idea of how to approach problems like this as it seems like a pretty fundamental way to transform data. If the total column makes it funky then feel free to exclude that.

回答1:

Try to do simpler.

I assume that you have a data.frame called data. These is a toy data set.

set.seed(0)
data <- data.frame(estrogen = runif(100) < .10,
               diabetes = runif(100) < .15,
               hypertension = runif(100) < .20,
               groups = cut(runif(100), c(0,.1,.4,.7,1), labels = c("my", "fy", "mo", "fo")))

Add new var to data frame for groups.

Then, use table() to get summaries

res <- rbind(
  table(data$estrogen, data$groups)[2,],
  table(data$diabetes, data$groups)[2,],
  table(data$hypertension, data$groups)[2,]
)
res <- cbind(apply(res, 1, sum), res)

Finaly, use colnames(res) y rownames(res) to set appropriate names to columns and rows.

colnames(res)[1] <- "Total"
rownames(res) <- c("estrogen", "diabetes", "hypertension")

Results

             Total my fy mo fo
estrogen        12  2  2  4  4
diabetes        28  1  8 11  8
hypertension    27  1 10 11  5


回答2:

So here is a data.table solution.

# create MRE - you have this already
n  <- 1000
set.seed(1)     # for reproducible example
df <- data.frame(ID=sample(1:n,n),had_stroke=sample(0:1,n,replace=TRUE),
                age=sample(25:85,n,replace=TRUE), gender=sample(c("M","F"),n,replace=TRUE),
                hypertension=sample(0:1,n,replace=TRUE),
                diabetes=sample(0:1,n,replace=TRUE),
                estrogen=sample(0:1,n,replace=TRUE))

# you start here.
library(data.table)
result <- melt(setDT(df),measure=5:7, variable.name="comorbidity")
result[,list(total=sum(value==1), 
             M.lt.50=sum(value[gender=="M"&age< 50]),
             F.lt.50=sum(value[gender=="F"&age< 50]),
             M.ge.50=sum(value[gender=="M"&age>=50]),
             F.ge.50=sum(value[gender=="F"&age>=50])),
       by=comorbidity]

#     comorbidity total M.lt.50 F.lt.50 M.ge.50 F.ge.50
# 1: hypertension   521     104     126     143     148
# 2:     diabetes   482     109     120     125     128
# 3:     estrogen   492      99     126     119     148

I know you asked for dlpyr/tidy (and now that I've provided an MRE dataset, I'm sure you'll get one...). IMO data.table is a better option: the syntax is no worse and it's almost always faster, usually by factors of 10-100.