So I had this question but the scope got a little larger/more complicated.
Basically I want to combine two tables and calculate the weighted average for any duplicate IDs. The problem is I will have multiple sets of columns that will need to use different weights. Here's my two datasets (RMS1 and RMS2) and the desired outcome (Joined):
RMS1:
id,freq1,sev1,count1,freq2,sev2,count2
111 0 2 50 1 2 25
222 1 3 75 2 4 50
RMS2:
id,freq1,sev1,count1,freq2,sev2,count2
222 2 4 25 6 6 200
333 4 5 60 3 2 20
Joined:
id freq1 sev1 freq2 sev2
111 0 2 1 2
222 1.25* 3.25* 5** 5.5**
333 4 5 3 2
So the * values are weighted averages based on count1, but the ** values are weighted averages based on count2 (at least they should be, I tried to do the math quickly). My entire dataset has 13 groups that use separate counts for weights. I have this code but obviously it needs to be expanded to take in multiple weights:
Joined <- bind_rows(RMS1, RMS2) %>%
group_by(id) %>%
summarise_at(vars(-count1), funs(weighted.mean(., count1))) %>%
as.data.frame()
Being new to R I have no idea where to even start. I found a possibly related question, but it's going over my head. Thanks in advance.
A solution using tidyr and dplyr. The idea is to convert the data frame, conduct the calculation, and transform back to the original format.
library(dplyr)
library(tidyr)
Joined <- bind_rows(RMS1, RMS2) %>%
mutate(rowid = 1:n()) %>%
gather(Column, Value, -id, -rowid) %>%
extract(Column, into = c("Type", "Number"),
regex = "([A-Za-z]*)([0-9]*)", convert = TRUE) %>%
spread(Type, Value) %>%
group_by(id, Number) %>%
summarise_at(vars(-rowid, -count), funs(weighted.mean(., count))) %>%
gather(Type, Value, -id, -Number) %>%
unite(Column, Type, Number, sep = "") %>%
spread(Column, Value) %>%
ungroup() %>%
as.data.frame()
Joined
# id freq1 freq2 sev1 sev2
# 1 111 0.00 1.0 2.00 2.0
# 2 222 1.25 5.2 3.25 5.6
# 3 333 4.00 3.0 5.00 2.0
DATA
RMS1 <- read.table(text = "id freq1 sev1 count1 freq2 sev2 count2
111 0 2 50 1 2 25
222 1 3 75 2 4 50
", header = TRUE)
RMS2 <- read.table(text = "id freq1 sev1 count1 freq2 sev2 count2
222 2 4 25 6 6 200
333 4 5 60 3 2 20
", header = TRUE)
1) dplyr rbind
the two input data frames together and then grouping by id
perform a
weighted mean on each required column:
library(dplyr)
RMS1 %>%
rbind(RMS2) %>%
group_by(id) %>%
summarize(freq1 = weighted.mean(freq1, count1),
sev1 = weighted.mean(sev1, count1),
freq2 = weighted.mean(freq2, count2),
sev2 = weighted.mean(sev2, count2)) %>%
ungroup
giving:
# A tibble: 3 x 5
id freq1 sev1 freq2 sev2
<int> <dbl> <dbl> <dbl> <dbl>
1 111 0 2.00 1.00 2.00
2 222 1.25 3.25 5.20 5.60
3 333 4.00 5.00 3.00 2.00
2) sqldf An alternative using sql is:
library(sqldf)
sqldf("select id,
sum(count1 * freq1 + 0.0) / sum(count1) freq1,
sum(count1 * sev1 + 0.0) / sum(count1) sev1,
sum(count2 * freq2 + 0.0) / sum(count2) freq2,
sum(count2 * sev2 + 0.0) / sum(count2) sev2
from (select * from RMS1 union select * from RMS2)
group by id", method = "raw")
giving:
id freq1 sev1 freq2 sev2
1 111 0.00 2.00 1.0 2.0
2 222 1.25 3.25 5.2 5.6
3 333 4.00 5.00 3.0 2.0
Note
The input in reproducible form is:
Lines1 <- "
id freq1 sev1 count1 freq2 sev2 count2
111 0 2 50 1 2 25
222 1 3 75 2 4 50"
RMS1 <- read.table(text = Lines1, header = TRUE)
Lines2 <- "
id freq1 sev1 count1 freq2 sev2 count2
222 2 4 25 6 6 200
333 4 5 60 3 2 20"
RMS2 <- read.table(text = Lines2, header = TRUE)75 + 25)