First some background info, which is probably more interesting on stats.stackexchange:
In my data analysis I try to compare the performance of different machine learning methods on time series data (regression, not classification). So for example I have trained a Boosting trained model and compare this with a Random Forest trained model (R package randomForest).
I use time series data where the explanatory variables are lagged values of other data and the dependent variable.
For some reason the Random Forest severely underperforms. One of the problems I could think of is that the Random Forest performs a sampling step of the training data for each tree. If it does this to time series data, the autoregressive nature of the series is completely removed.
To test this idea, I would like to replace the (bootstrap) sampling step in the randomForest() function with a so called block-wise bootstrap step. This basically means I cut the training set into k parts, where k<<N
, where each k-th part is in the original order. If I sample these k parts, I could still benefit from the 'randomness' in the Random Forest, but with the time series nature left largely intact.
Now my problem is this:
To achieve this I would normally copy the existing function and edit the desired step/lines.
randomForest2 <- randomForest()
But the randomForest() function seems to be a wrapper for another wrapper for deeper underlying functions. So how can I edit the actual bootstrap step in the randomForest() function and still run the rest of the function regularly?
To directly alter sampling of randomForest(type="reggression"): Learn basic C programming, download from cran source code randomForest.4.6-10.tar.gz, (if windows install Rtools), (if OSX install Xcode), install and open Rstudio, start new project, choose package, unpack ...tar.gz into folder, look into src folder, open regrf.c, checkout line 151 and 163. Write new sampling strategy, press occationally Ctrl+Shift+B package to rebuild/compile and overwrite randomForest library, correct stated compile errors, test occasionally if package still works, spend some hours figuring out the old uninformative code, perhaps change description file, namespace file, and some few other references so the package will change name to randomForestMod, rebuild, voilla.
A more easy way not changing the randomForest is described below. Any trees with the same feature inputs can be patched together with the function randomForest::combine, so you can design your sampling regime in pure R code. I thought it actually was a bad idea, but for this very naive simulation it actually works with similar/slightly better performance! Remember to not predict the absolute target value, but instead a stationary derivative such as relative change, absolute change etc. If predicting the absolute value, RF will fall back to predicting tomorrow is something pretty close of today. Which is a trivial useless information.
edited code [22:42 CEST]
So for me the solution wasn't editing the existing randomForest function. Instead I coded the block-wise bootstrap myself, using the
split2
function given by Soren H. Welling to create the blocks. Once I had my data block-wise bootstrapped, I looked for a package (rpart
) that performed just a single Regression Tree and aggregated it myself (taking the means).The result for my actual data is a slightly but consistently improved version over the normal random forest performance in terms of RMSPE.
For the code below the performance seems to be a coin-toss.
Taking Soren's code as an example it looks a bit like this: