R Rolling Random Forest for Variables Selection [c

2019-08-05 12:11发布

问题:

I've got a daily OHLC dataset of the Euro Stoxx 50 index since 2008 which looks like that :

              Open    High     Low   Close Volume Adjusted
2008-01-02 4393.53 4411.59 4330.73 4339.23      0  4339.23
2008-01-03 4335.91 4344.36 4312.34 4333.42      0  4333.42
2008-01-04 4331.25 4343.46 4253.69 4270.53      0  4270.53
2008-01-07 4268.43 4294.45 4257.22 4283.37      0  4283.37
2008-01-08 4292.40 4330.56 4292.40 4295.23      0  4295.23
2008-01-09 4285.34 4285.34 4246.92 4258.32      0  4258.32

I've computed several technical rules using the TTRpackage. I thus get a bigger dataset like that :

               RSI2     RSI3     RSI4     RSI5    RSI10    RSI20     SMA5    SMA20    SMA60     EMA5    EMA20    EMA60      atr      SMI
2009-01-07 97.964071 92.62210 87.21605 82.40040 66.95642 55.19221 19720.64 18655.29 17758.68 2556.777 2556.777 2556.777 82.06602 27.52145
2009-01-08 43.766573 58.62387 62.97794 64.03382 60.23197 52.99739 19756.44 18666.60 17754.07 2566.499 2566.499 2566.499 80.33416 29.12141
2009-01-09 27.182247 44.97072 52.29336 55.50633 56.74068 51.80171 19776.92 18674.31 17750.34 2523.372 2523.372 2523.372 78.65886 29.37878
2009-01-12 13.371347 30.46561 39.97055 45.24210 52.16207 50.17764 19788.02 18683.05 17748.76 2524.466 2524.466 2524.466 78.58966 28.17871
2009-01-13  6.141462 19.52298 29.30404 35.68593 47.25383 48.32987 19772.25 18693.01 17749.35 2488.165 2488.165 2488.165 76.08326 25.34705
2009-01-14  2.712386 11.97834 20.69541 27.26891 42.10718 46.23469 19747.87 18694.16 17742.88 2449.353 2449.353 2449.353 75.42231 20.65686

I would like to know for each working quarter what are the most significant technical rules. I've decided to use the Random Forest-RI algorithm which have been coded in the randomForestpackage, compute the Breiman importance measure (thanks to the importancefunction) and selection the technical rules that have a variable importance measure greater that the mean of the quarterly sample. Eventually, I would like to get the reduced dataset of technical rules during the whole period to compute statistics and so on.

Given that the number of significant technical rules can vary over time, the dimensions of the array which contains the most significant technical rules are not the same from a quarter to antoher. As a consequence, I can't put all my values in a single object.

Is there a convenient way to store all my quarter dataset?

thanks.

回答1:

Use a data frame or an xts object. They both cope well with varying numbers of columns. In your case, as all your data columns are numeric type, you can use the xts object.

You said "rolling" in your title. Did you mean you want to analyze 90 day overlapping periods? E.g. 2008-01-02 to 2008-04-02, then 2008-01-03 to 2008-04-03, and so on? If so rollapply(data,width=90,FUN) can be used. If you wanted to deal with quarters, one at a time, quarters <- split(data,'quarters') and then (as that gives you a list of xts objects) lapply(quarters,FUN)

I think your issue with using a single data structure was that SMA5 is available from 2008-01-08, but that SMA200 is not available until almost the end of the year; meaning that in the first three quarters the SMA200 column will contain nothing but NAs? This is fine. Keep the NAs and deal with them just before you pass the data to RandomForest.

In FUN you will remove the columns that contain NA like this (where xq is an xts object containing data for just one quarter):

xq = xq[,!apply(is.na(x),2,any)]

UPDATE: After re-reading your question, and your follow-up question I think the above answers the question you didn't have! I thought the issue was having NAs in your TTR table, and that RandomForest does not like them.

On reflection, I think your actual question was "The RandomForest gives me a varying number of good indicators from its analysis of each quarter, how do I deal with that?" The answer is a ragged data structure, a list. One list entry per quarter. The list entry itself can be anything, even an xts object, but in this case a simple character vector of indicators names seems to be perfect. This is shown nicely in Zach's answer to your other question.