EDIT: Reduced the size of the dataset
A sample data:
df <- data.frame(loc.id = rep(1:10, each = 80*36),
year = rep(rep(1980:2015, each = 80), times = 10),
day = rep(rep(1:80, times = 36),times = 10),
rain = runif(10*36*80, min = 0 , max = 5),
swc = runif(10*36*80,min = 0, max = 50),
SW.max = rep(runif(10, min = 100, max = 200), each = 80*36),
SW.ini = runif(10*36*80),
PETc = runif(10*36*80, min = 0 , max = 1.3),
SW = NA,
PAW = NA,
aetc = NA)
df
contains daily data (80 days) for 1980-2015 for 10 locations.
For each location X year combination, I want to do following calculation
list.result <- list() # create a list to store all results
ptm <- proc.time()
n <- 0
for(i in seq_along(unique(df$loc.id))){
location <- unique(df$loc.id)[i]
print(location)
for(j in seq_along(unique(df$year))){
yr <- unique(df$year)[j]
print(yr)
df_year <- df[df$loc.id == location & df$year == yr,] # subset data for location i and year y
# for the first row of data frame, i need to calculate some values
SW.ini <- df_year$SW.ini[1]
SW.max <- df_year$SW.max[1]
df_year$PAW[1] <- SW.ini + df_year$rain[1]
df_year$aetc[1] <- ifelse(df_year$PAW[1] >= df_year$swc[1],
df_year$PETc[1],(df_year$PAW[1]/df_year$swc[1])*df_year$PETc[1])
df_year$aetc[1] <- ifelse(df_year$aetc[1] > df_year$PAW[1], df_year$PAW[1], df_year$aetc[1])
df_year$SW[1] <- SW.ini + df_year$rain[1] - df_year$aetc[1]
df_year$SW[1] <- ifelse(df_year$SW[1] > SW.max, SW.max, ifelse(df_year$SW[1] < 0, 0,df_year$SW[1]))
# for row 2 till row n of df_year, I need to do this:
for (day in 2:nrow(df_year)){
df_year$PAW[day] <- df_year$SW[day - 1] + df_year$rain[day]
df_year$aetc[day] <- ifelse(df_year$PAW[day] >= df_year$swc[day], df_year$PETc[day], (df_year$PAW[day]/df_year$swc[day]) * df_year$PETc[day])
df_year$aetc[day] <- ifelse(df_year$aetc[day] > df_year$PAW[day], df_year$PAW[day],df_year$aetc[day])
df_year$SW[day] <- df_year$SW[day - 1] + df_year$rain[day] - df_year$aetc[day]
df_year$SW[day] <- ifelse(df_year$SW[day] > SW.max,SW.max, ifelse(df_year$SW[day] < 0, 0,df_year$SW[day]))
}
n <- n + 1
list.result[[n]] <- df_year
}}
proc.time() - ptm
user system elapsed
8.64 0.00 8.75
final.dat <- rbindlist(list.result)
This loop is sequential and I thought it is a good candidate for foreach in R. I have not really worked with foreach so doing some online research brought me to this:
library(doParallel)
cl <- makeCluster(4) # if I understood this correctly, it assings number of cores to be used
registerDoParallel(cl)
foreach(i = seq_along(unique(df$loc.id)) %dopar% {
list.result <- list()
for(j in seq_along(1980:2015)){
df_year <- df[df$loc.id == unique(df$loc.id)[i] & df$year == unique(df$year)[j],] # subset data for location i and year y
# for the first row of data frame, i need to calculate some values
SW.ini <- df_year$SW.ini[1]
SW.max <- df_year$SW.max[1]
df_year$PAW[1] <- SW.ini + df_year$rain[1]
df_year$aetc[1] <- ifelse(df_year$PAW[1] >= df_year$swc[1], df_year$PETc[1],(df_year$PAW[1]/df_year$swc[1])*df_year$PETc[1])
df_year$aetc[1] <- ifelse(df_year$aetc[1] > df_year$PAW[1], df_year$PAW[1], df_year$aetc[1])
df_year$SW[1] <- SW.ini + df_year$rain[1] - df_year$aetc[1]
df_year$SW[1] <- ifelse(df_year$SW[1] > SW.max, SW.max, ifelse(df_year$SW[1] < 0, 0,df_year$SW[1]))
# for row 2 till row n of df_year, I need to do this:
for (day in 2:nrow(df_year)){
df_year$PAW[day] <- df_year$SW[day - 1] + df_year$rain[day]
df_year$aetc[day] <- ifelse(df_year$PAW[day] >= df_year$swc[day], df_year$PETc[day], (df_year$PAW[day]/df_year$swc[day]) * df_year$PETc[day])
df_year$aetc[day] <- ifelse(df_year$aetc[day] > df_year$PAW[day], df_year$PAW[day],df_year$aetc[day])
df_year$SW[day] <- df_year$SW[day - 1] + df_year$rain[day] - df_year$aetc[day]
df_year$SW[day] <- ifelse(df_year$SW[day] > SW.max,SW.max, ifelse(df_year$SW[day] < 0, 0,df_year$SW[day]))
}
list.result[[j]] <- df_year
}
dat <- rbindlist(list.result)
fwrite(dat,paste0(i,"dat.csv"))
}
My questions are:
1) Is the above data a good candidate for foreach
2) There is a for-loop within the foreach. Does that make sense?
3) How do I make the above foreach run and return all the results
This code replaces the inner loop
and is about 60x faster than the implementation in the original question. Note that this is the approach taken in C++, i.e., allocate and update new vectors, rather than existing parts of the data.frame; this is a big part of the performance difference, and the benefit can be obtained WITHOUT Rcpp.
This is a generalization (very light testing!) to iterate on a location.year x day matrix
with inputs from the original, assuming the input is sorted by year, location, and day
It is about 15x faster than
fill1()
on a per-location.date basis, for the subset of data in the question. The operation on the sample data takes about 10 milliseconds, and about 10 seconds for the full data -- 5x slower than Matt's C++ solution but still a very substantial improvement over the original and employing basic R techniques that will improve code in many different areas.To address your three questions:
Using
Rcpp
anddata.table
insteadCompiling the logic with C++ and applying it by group using data.table grouping operations gives a ~2,000x speed-up from your baseline, far greater than you might hope to get by parallelizing.
On your original example, which had 39,420,000 rows, this executes on my machine in 1.883 seconds; and on the revised one with 28,800 rows, this executes in 0.004 seconds
Define and compile a
C++
function,CalcSW()
inline in the R script:One note: counting in
C
/C++
starts at0
, unlikeR
, which starts at1
-- that's why the indices are different hereCreate data.table
Execute the function
CalcSW()
on thedf
for each combination ofloc.id
andyear
, assign returned values to the three columns simultaneously:...
Results:
...
I'm not 100% positive I implemented your logic perfectly, but the logic should be pretty straightforward to tweak where I may have missed something, I implemented it in a very similar manner to how you laid it out.
One other note: It's way easier to write
C++
with auto-indenting and code highlighting (whether you're using RStudio or Emacs) you get if you create a separate file, named something likeTestCode.cpp
formatted like below.Then, you can either use
Rcpp::sourceCpp("TestCode.cpp")
to compile your function in your R Script, or you can copy and paste everything except for the first three lines as a character string into as an argument ofRcpp::cppFunction()
like I did above.