A sequential, cumulative calculation
I need to make a time-series calculation, where the value calculated in each row depends on the result calculated in the previous row. I am hoping to use the convenience of data.table
. The actual problem is a hydrological model -- a cumulative water balance calculation, adding rainfall at each time step and subtracting runoff and evaporation as a function of the current water volume. The dataset includes different basins and scenarios (groups). Here I will use a simpler illustration of the problem.
A simplified example of the calculation looks like this, for each time step (row) i
:
v[i] <- a[i] + b[i] * v[i-1]
a
and b
are vectors of parameter values, and v
is the result vector. For the first row (i == 1
) the initial value of v
is taken as v0 = 0
.
First attempt
My first thought was to use shift()
in data.table
. A minimal example, including the desired result v.ans
, is
library(data.table) # version 1.9.7
DT <- data.table(a = 1:4,
b = 0.1,
v.ans = c(1, 2.1, 3.21, 4.321) )
DT
# a b v.ans
# 1: 1 0.1 1.000
# 2: 2 0.1 2.100
# 3: 3 0.1 3.210
# 4: 4 0.1 4.321
DT[, v := NA] # initialize v
DT[, v := a + b * ifelse(is.na(shift(v)), 0, shift(v))][]
# a b v.ans v
# 1: 1 0.1 1.000 1
# 2: 2 0.1 2.100 2
# 3: 3 0.1 3.210 3
# 4: 4 0.1 4.321 4
This doesn't work, because shift(v)
gives a copy of the original column v
, shifted by 1 row. It is unaffected by assignment to v
.
I also considered building the equation using cumsum() and cumprod(), but that won't work either.
Brute force approach
So I resort to a for loop inside a function for convenience:
vcalc <- function(a, b, v0 = 0) {
v <- rep(NA, length(a)) # initialize v
for (i in 1:length(a)) {
v[i] <- a[i] + b[i] * ifelse(i==1, v0, v[i-1])
}
return(v)
}
This cumulative function works fine with data.table:
DT[, v := vcalc(a, b, 0)][]
# a b v.ans v
# 1: 1 0.1 1.000 1.000
# 2: 2 0.1 2.100 2.100
# 3: 3 0.1 3.210 3.210
# 4: 4 0.1 4.321 4.321
identical(DT$v, DT$v.ans)
# [1] TRUE
My question
My question is, can I write this calculation in a more concise and efficient data.table
way, without having to use the for loop and/or function definition? Using set()
perhaps?
Or is there a better approach all together?
Edit: A better loop
David's Rcpp solution below inspired me to remove the ifelse()
from the for
loop:
vcalc2 <- function(a, b, v0 = 0) {
v <- rep(NA, length(a))
for (i in 1:length(a)) {
v0 <- v[i] <- a[i] + b[i] * v0
}
return(v)
}
vcalc2()
is 60% faster than vcalc()
.
It may not be 100% what you are looking for, as it does not use the "data.table-way" and still uses a for-loop. However, this approach should be faster (I assume you want to use data.table and the data.table-way to speed up your code). I leverage Rcpp to write a short function called HydroFun
, that can be used in R like any other function (you just need to source the function first). My gut-feeling tells me that the data.table way (if existent) is pretty complicated because you cannot compute a closed-form solution (but I may be wrong on this point...).
My approach looks like this:
The Rcpp function looks like this (in the file: hydrofun.cpp
):
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector HydroFun(NumericVector a, NumericVector b, double v0 = 0.0) {
// get the size of the vectors
int vecSize = a.length();
// initialize a numeric vector "v" (for the result)
NumericVector v(vecSize);
// compute v_0
v[0] = a[0] + b[0] * v0;
// loop through the vector and compute the new value
for (int i = 1; i < vecSize; ++i) {
v[i] = a[i] + b[i] * v[i - 1];
}
return v;
}
To source and use the function in R you can do:
Rcpp::sourceCpp("hydrofun.cpp")
library(data.table)
DT <- data.table(a = 1:4,
b = 0.1,
v.ans = c(1, 2.1, 3.21, 4.321))
DT[, v_ans2 := HydroFun(a, b, 0)]
DT
# a b v.ans v_ans2
# 1: 1 0.1 1.000 1.000
# 2: 2 0.1 2.100 2.100
# 3: 3 0.1 3.210 3.210
# 4: 4 0.1 4.321 4.321
Which gives the result you are looking for (at least from the value-perspective).
Comparing the speeds reveals a speed-up of roughly 65x.
library(microbenchmark)
n <- 10000
dt <- data.table(a = 1:n,
b = rnorm(n))
microbenchmark(dt[, v1 := vcalc(a, b, 0)],
dt[, v2 := HydroFun(a, b, 0)])
# Unit: microseconds
# expr min lq mean median uq max neval
# dt[, `:=`(v1, vcalc(a, b, 0))] 28369.672 30203.398 31883.9872 31651.566 32646.8780 68727.433 100
# dt[, `:=`(v2, HydroFun(a, b, 0))] 381.307 421.697 512.2957 512.717 560.8585 1496.297 100
identical(dt$v1, dt$v2)
# [1] TRUE
Does that help you in any way?
I think Reduce
together with accumulate = TRUE
is a commonly used technique for these types of calculations (see e.g. recursively using the output as an input for a function). It is not necessarily faster than a well-written loop*, and I don't know how data.table
-esque you believe it is, still I want to suggest it for your toolbox.
DT[ , v := 0][
, v := Reduce(f = function(v, i) a[i] + b[i] * v, x = .I[-1], init = a[1], accumulate = TRUE)]
DT
# a b v.ans v
# 1: 1 0.1 1.000 1.000
# 2: 2 0.1 2.100 2.100
# 3: 3 0.1 3.210 3.210
# 4: 4 0.1 4.321 4.321
Explanation:
Set initial value of v to 0
(v := 0
). Use Reduce
to apply function f
on an integer vector of row numbers except the first row (x = .I[-1]
). Instead add a[1]
to the start of of x
(init = a[1]
).
Reduce
then "successively applies f to the elements [...] from left to right".
The successive reduce combinations are "accumulated" (accumulate = TRUE
).
*See e.g. here, where you also can read more about Reduce
in this section.