I am performing a per policy life insurance valuation in R. Monthly cash flow projections are performed per policy and returns a data frame in the following format (for example):
Policy1 = data.frame(ProjM = 1:200,
Cashflow1 = rep(5,200),
Cashflow2 = rep(10,200))
My model returns a list (using lapply and a function which performs the per policy cashflow projection - based on various per policy details, escalation assumptions and life contingencies). I want to aggregate the cash flows across all policies by ProjM
. The following code does what I want, but looking for a more memory efficient way (ie not using the rbindlist
function). Example data:
Policy1 = data.frame(ProjM = 1:5,
Cashflow1 = rep(5,5),
Cashflow2 = rep(10,5))
Policy2 = data.frame(ProjM = 1:3,
Cashflow1 = rep(50,3),
Cashflow2 = rep(-45,3))
# this is the output containing 35000 data frames:
ListOfDataFrames = list(Policy1 = Policy1, Policy2 = Policy2)
My code:
library(data.table)
OneBigDataFrame <- rbindlist(ListOfDataFrames)
MyOutput <- aggregate(. ~ ProjM, data = OneBigDataFrame, FUN = sum)
Output required:
ProjM Cashflow1 Cashflow2
1 55 -35
2 55 -35
3 55 -35
4 5 10
5 5 10
I have looked for examples, and R aggregate list of dataframe performs aggregation for all data frames, but do not combine them into 1 data frame.
I think this solution might be efficient. Give it a try and let me know
With
data.table
syntax the one step approach would be to create the big data.table first and then do the aggregation:or, more concise
Now, the OP has requested to avoid creating the big data.table first in order to save memory. This requires a two step approach where the aggregates are computed for each data.table which are then aggregated to a grand total in the final step:
Note that
setDT()
is used here to coerce the data.frames to data.table by reference, i.e., without creating an additional copy which saves time and memory.Benchmark
Using the benchmark data of d.b (list of 10000 data.frames with 100 rows each, 28.5 Mb in total) with all answers provided so far:
The fastest solution is the one step approach using
data.table
which is 15 times faster than the second fastest. Surprisingly, the two stepdata.table
approaches are magnitudes slower than the one step approach.To make sure that all solutions return the same result this can be checked using
which return
TRUE
in all cases.