How can I, in R calculate the overall variance and the variance for each group from a dataset that looks like this (for example):
Group Count Value
A 3 5
A 2 8
B 1 11
B 3 15
I know to calculate the variance as a whole, ignoring the groups I would do:
var(rep(x$Value, x$Count)),
but how do I automatically calculate the variance for each group accounting for the frequency? E.g., the variance for group A, group B, etc.,.. I would like my output to have the following headers:
Group, Total Count, Group Variance
I have also reviewed this link; R computing mean, median, variance from file with frequency distribution which is different (does not have the group component) so this is not a duplicate.
thank you for all of the help.
One option is using data.table
. Convert the data.frame to data.table (setDT
) and get the var
of "Value" and sum
of "Count" by "Group".
library(data.table)
setDT(df1)[, list(GroupVariance=var(rep(Value, Count)),
TotalCount=sum(Count)) , by = Group]
# Group GroupVariance TotalCount
#1: A 2.7 5
#2: B 4.0 4
a similar way using dplyr
is
library(dplyr)
group_by(df1, Group) %>%
summarise(GroupVariance=var(rep(Value,Count)), TotalCount=sum(Count))
# Group GroupVariance TotalCount
#1 A 2.7 5
#2 B 4.0 4
Here's a quick wrapper with base R. First step is to grow your data set by Count, and then calculate variance by group
df1 <- df[rep(seq_len(nrow(df)), df$Count), ]
with(df1, tapply(Value, Group, var))
# A B
# 2.7 4.0
Or similarly
aggregate(Value ~ Group, df1, function(x) c(Var = var(x), Count = length(x)))
# Group Value.Var Value.Count
# 1 A 2.7 5.0
# 2 B 4.0 4.0