I have a data.table
like:
library(data.table)
widgets <- data.table(serial_no=1:100,
color=rep_len(c("red","green","blue","black"),length.out=100),
style=rep_len(c("round","pointy","flat"),length.out=100),
weight=rep_len(1:5,length.out=100) )
Although I am not sure this is the most data.table
way, I can calculate subgroup frequency by group using table
and length
in a single step-- for example, to answer the question "What percent of red widgets are round?"
edit: this code does not provide the right answer
# example A
widgets[, list(style = unique(style),
style_pct_of_color_by_count =
as.numeric(table(style)/length(style)) ), by=color]
# color style style_pct_of_color_by_count
# 1: red round 0.32
# 2: red pointy 0.32
# 3: red flat 0.36
# 4: green pointy 0.32
# ...
But I can't use that approach to answer questions like "By weight, what percent of red widgets are round?" I can only come up with a two-step approach:
# example B
widgets[,list(cs_weight=sum(weight)),by=list(color,style)][,list(style, style_pct_of_color_by_weight=cs_weight/sum(cs_weight)),by=color]
# color style style_pct_of_color_by_weight
# 1: red round 0.3466667
# 2: red pointy 0.3466667
# 3: red flat 0.3066667
# 4: green pointy 0.3333333
# ...
I'm looking for a single-step approach to B, and A if improvable, in an explanation that deepens my understanding of data.table
syntax for by-group operations. Please note that this question is different from Weighted sum of variables by groups with data.table because mine involves subgroups and avoiding multiple steps. TYVM.
it may be a good idea to use
dplyr
This is almost a single step:
How it works: Construct your denominator for the top-level group (
color
) before going to the finer group (color
withstyle
) to tabulate.Alternatives. If
style
s repeat within eachcolor
and this is only for display purposes, try atable
:For B, this expands the data so that there is one observation for each unit of weight. With large data, such an expansion would be a bad idea (since it costs so much memory). Also,
weight
has to be an integer; otherwise, its sum will be silently truncated to one (e.g., tryrep(1,2.5) # [1] 1 1
).Calculate a frequency table for each
style
withincolor
and then for each row look up the frequency for that row'sstyle
in that table finally dividing by the number of rows within thatcolor
.giving:
This could readily be translated to base or dplyr, if desired: