table with data (its a data.table object) that looks like the following :
date stock_id logret
1: 2011-01-01 1 0.001
2: 2011-01-02 1 0.003
3: 2011-01-03 1 0.005
4: 2011-01-04 1 0.007
5: 2011-01-05 1 0.009
6: 2011-01-06 1 0.011
7: 2011-01-01 2 0.013
8: 2011-01-02 2 0.015
9: 2011-01-03 2 0.017
10: 2011-01-04 2 0.019
11: 2011-01-05 2 0.021
12: 2011-01-06 2 0.023
13: 2011-01-01 3 0.025
14: 2011-01-02 3 0.027
15: 2011-01-03 3 0.029
16: 2011-01-04 3 0.031
17: 2011-01-05 3 0.033
18: 2011-01-06 3 0.035
The above can be created as :
DT = data.table(
date=rep(as.Date('2011-01-01')+0:5,3) ,
stock_id=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
logret=seq(0.001, by=0.002, len=18));
setkeyv(DT,c('stock_id','date'))
Of course the real table is larger with many more stock_ids and dates. The aim to to reshape this data table such that I can run a regression of all stockid log_returns with their corresponding log_returns with a lag of 1 day (or prior traded day in case of weekends).
The final results would look like :
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
....
16: 2011-01-04 3 0.031 0.029
17: 2011-01-05 3 0.033 0.031
18: 2011-01-06 3 0.035 0.033
I'm finding this data structure really tricky to build without mixing up my stockid.
Update:
In the current development version of data.table, v1.9.5,
shift()
is implemented #965, which takes two types at the momenttype = "lag"
(default) andtype = "lead"
. See?shift
for more on usage.With this, we can simply do:
Thanks to Matthew Dowle's advice, I was able to use the following :
Results are :
Just some additional notes due to Alex's comment. The reason you have difficulties understanding what's going on here is that a lot of things are done within one line. So it's always a good idea to break things down.
What do we actually want? We want a new column
lagret
and the syntax to add a new column in data.table is the following:where
xxx
has to be filled up with whatever you want to have in columnlagret
. So if we just want a new column that gives us the rows, we could just callHere, we actually want the lagged value of
logret
, but we have to consider that there are many stocks in here. That's why we do a self-join, i.e. we join the data.tableDT
with itself by the columnsstock_id
anddate
, but since we want the previous value of each stock, we usedate-1
. Note that we have to set the keys first to do such a join:As you can see, we now have what we want.
logret
is now lagged by one period. But we actually want that in a new columnlagret
inDT
, so we just get that column by calling [[3L]] (this means nothing else then get me the third column) and name this new columnlagret
:This is already the correct solution. In this simple case, we do not need
roll=TRUE
because there are no gaps in the dates. However, in a more realistic example (as mentioned above, for instance when we have weekends), there might be gaps. So let's make such a realistic example by just deleting two days in theDT
for the first stock:As you can see, the problem is now that we don't have a value for the 6th of January. That's why we use
roll=TRUE
:Just have a look on the documentation on how
roll=TRUE
works exactly. In a nutshell: If it can't find the previous value (herelogret
for the 5th of January), it just takes the last available one (here from the 3rd of January).