I have a big data frame (+239k observations on 57 variables) with some sickness descriptions and medicines administered to those sicknesses for people in different age ranges. I'd like to find those medicines in the top quartile of frequency use for each sickness description.
To make a reproducible example, I created a 1000 observations data frame:
set.seed(1);sk<-as.factor(sample(c("sick A","sick B","sick C","sick D"),1000,replace=T));md<-as.factor(sample(c("med 1","med 2","med 3","med 4","med 5")));age<-as.factor(sample(c("group a","group b","group c"),1000,replace=T))
df<-data.frame(obs=1:1000,md=md,sk=sk,age=age)
I can produce a table of frequencies with
xt<-xtabs(~md+sk+age,df)
I can then produce a data frame for each age group
XTDF_a<-as.data.frame(xt[,,"group a"])
and then find the 3rd quartile of frequencies of each sickness with:
Q3_a<-apply(XTDF_a,2,function(x) quantile(x,probs = .75))
to which I can compare and obtain which Medicines are over the 3rd quartile for each Sickness
XTDF_a>Q3_a
sk
md sick A sick B sick C sick D
med 1 FALSE FALSE TRUE FALSE
med 2 FALSE FALSE FALSE FALSE
med 3 TRUE TRUE FALSE FALSE
med 4 FALSE FALSE FALSE TRUE
med 5 FALSE FALSE FALSE FALSE
I can conclude that med 3
is the top selection for Sickness A, and so on (I'm acutally looping to extract that information). I then go back and repeat the process for group b, c.... which is almost impossible with the size of data I have (sicknesses are about 4200 levels and medicines are about 1150 levels).
I'm pretty sure there should be a different, easier way to achieve this. I'd appreciate a hint on a better path to follow.
I think you can speed this up by writing a bit more precise function and then using aggregate
to get the results. You could also use by
if you want a more list-based approach, which might be more useful for your next use. I think it will still be slow, but not as slow as looping.
# Here is what you gave me originally
set.seed(1)
sk<-as.factor(sample(c("sick A","sick B","sick C","sick D"),1000,replace=T))
md<-as.factor(sample(c("med 1","med 2","med 3","med 4","med 5")))
age<-as.factor(sample(c("group a","group b","group c"),1000,replace=T))
df<-data.frame(obs=1:1000,md=md,sk=sk,age=age)
# Define a function that basically does what you did before, but uses table()
func.get_75th_meds <- function(vector_of_meds) {
freq <- table(vector_of_meds)
return(names(freq)[freq >= quantile(x = freq,probs = 0.75)])
}
aggregate(x = list(Meds = df$md),
by = list(Sickness = df$sk,Group = df$age),
FUN = func.get_75th_meds)
Sickness Group Meds
1 sick A group a med 3, med 5
2 sick B group a med 3, med 5
3 sick C group a med 1, med 2, med 4, med 5
4 sick D group a med 2, med 4
5 sick A group b med 4, med 5
6 sick B group b med 1, med 2, med 5
7 sick C group b med 1, med 2
8 sick D group b med 2, med 3
9 sick A group c med 2, med 5
10 sick B group c med 2, med 4
11 sick C group c med 1, med 2, med 4
12 sick D group c med 1, med 3, med 4
EDITED TO ADD: Here's the alternative with by()
using the same function.
by(data = df$md,
INDICES = list(Sickness = df$sk,Group = df$age),
FUN = func.get_75th_meds)
Sickness: sick A
Group: group a
[1] "med 3" "med 5"
---------------------------------------------------------------
Sickness: sick B
Group: group a
[1] "med 3" "med 5"
---------------------------------------------------------------
... and so on
apply
can work on 3-d arrays, and you can specify multiple dimensions to iterate over:
> apply(xt,2:3,function(x) x > quantile(x, probs = .75))
, , age = group a
sk
md sick A sick B sick C sick D
med 1 FALSE FALSE TRUE FALSE
med 2 FALSE FALSE FALSE FALSE
med 3 TRUE TRUE FALSE FALSE
med 4 FALSE FALSE FALSE TRUE
med 5 FALSE FALSE FALSE FALSE
, , age = group b
sk
md sick A sick B sick C sick D
med 1 FALSE FALSE TRUE FALSE
med 2 FALSE FALSE FALSE FALSE
med 3 FALSE FALSE FALSE FALSE
med 4 TRUE FALSE FALSE FALSE
med 5 FALSE TRUE FALSE FALSE
, , age = group c
sk
md sick A sick B sick C sick D
med 1 FALSE FALSE FALSE TRUE
med 2 FALSE FALSE FALSE FALSE
med 3 FALSE FALSE FALSE FALSE
med 4 FALSE FALSE FALSE FALSE
med 5 TRUE FALSE FALSE FALSE