I'm trying to get better with dplyr and tidyr but I'm not used to "thinking in R". An example may be best. The table I've generated from my data in sql looks like this:
╔═══════════╦════════════╦═════╦════════╦══════════════╦══════════╦══════════════╗
║ patientid ║ had_stroke ║ age ║ gender ║ hypertension ║ diabetes ║ estrogen HRT ║
╠═══════════╬════════════╬═════╬════════╬══════════════╬══════════╬══════════════╣
║ 934988 ║ 1 ║ 65 ║ M ║ 1 ║ 1 ║ 0 ║
║ 94044 ║ 0 ║ 69 ║ F ║ 1 ║ 0 ║ 0 ║
║ 689348 ║ 0 ║ 56 ║ F ║ 0 ║ 1 ║ 1 ║
║ 902498 ║ 1 ║ 45 ║ M ║ 0 ║ 0 ║ 1 ║
║ … ║ ║ ║ ║ ║ ║ ║
╚═══════════╩════════════╩═════╩════════╩══════════════╩══════════╩══════════════╝
I would like to create an output table that conveys the following information:
╔══════════════╦════════╦══════════╦══════════╦══════════╦═══════════╗
║ ║ total ║M lt50 yo ║F lt50 yo ║M gte50yo ║F gte 50yo ║
╠══════════════╬════════╬══════════╬══════════╬══════════╬═══════════╣
║ estrogen HRT ║ 347 ║ 2 ║ 65 ║ 4 ║ 97 ║
║ diabetes ║ 13922 ║ 54 ║ 73 ║ 192 ║ 247 ║
║ hypertension ║ 8210 ║ 102 ║ 187 ║ 443 ║ 574 ║
╚══════════════╩════════╩══════════╩══════════╩══════════╩═══════════╝
Total is the total number of patients with that comorbidity (easy enough: sum(data$estrogen == 1) etc). The other cells are now the number of patients with that comorbidity in that age and gender stratification where had_stroke==1.
I'd love to just get a general idea of how to approach problems like this as it seems like a pretty fundamental way to transform data. If the total column makes it funky then feel free to exclude that.
Try to do simpler.
I assume that you have a data.frame
called data
. These is a toy data set.
set.seed(0)
data <- data.frame(estrogen = runif(100) < .10,
diabetes = runif(100) < .15,
hypertension = runif(100) < .20,
groups = cut(runif(100), c(0,.1,.4,.7,1), labels = c("my", "fy", "mo", "fo")))
Add new var to data frame for groups.
Then, use table()
to get summaries
res <- rbind(
table(data$estrogen, data$groups)[2,],
table(data$diabetes, data$groups)[2,],
table(data$hypertension, data$groups)[2,]
)
res <- cbind(apply(res, 1, sum), res)
Finaly, use colnames(res)
y rownames(res)
to set appropriate names to columns and rows.
colnames(res)[1] <- "Total"
rownames(res) <- c("estrogen", "diabetes", "hypertension")
Results
Total my fy mo fo
estrogen 12 2 2 4 4
diabetes 28 1 8 11 8
hypertension 27 1 10 11 5
So here is a data.table solution.
# create MRE - you have this already
n <- 1000
set.seed(1) # for reproducible example
df <- data.frame(ID=sample(1:n,n),had_stroke=sample(0:1,n,replace=TRUE),
age=sample(25:85,n,replace=TRUE), gender=sample(c("M","F"),n,replace=TRUE),
hypertension=sample(0:1,n,replace=TRUE),
diabetes=sample(0:1,n,replace=TRUE),
estrogen=sample(0:1,n,replace=TRUE))
# you start here.
library(data.table)
result <- melt(setDT(df),measure=5:7, variable.name="comorbidity")
result[,list(total=sum(value==1),
M.lt.50=sum(value[gender=="M"&age< 50]),
F.lt.50=sum(value[gender=="F"&age< 50]),
M.ge.50=sum(value[gender=="M"&age>=50]),
F.ge.50=sum(value[gender=="F"&age>=50])),
by=comorbidity]
# comorbidity total M.lt.50 F.lt.50 M.ge.50 F.ge.50
# 1: hypertension 521 104 126 143 148
# 2: diabetes 482 109 120 125 128
# 3: estrogen 492 99 126 119 148
I know you asked for dlpyr/tidy (and now that I've provided an MRE dataset, I'm sure you'll get one...). IMO data.table is a better option: the syntax is no worse and it's almost always faster, usually by factors of 10-100.