table with data (its a data.table object) that looks like the following :
date stock_id logret
1: 2011-01-01 1 0.001
2: 2011-01-02 1 0.003
3: 2011-01-03 1 0.005
4: 2011-01-04 1 0.007
5: 2011-01-05 1 0.009
6: 2011-01-06 1 0.011
7: 2011-01-01 2 0.013
8: 2011-01-02 2 0.015
9: 2011-01-03 2 0.017
10: 2011-01-04 2 0.019
11: 2011-01-05 2 0.021
12: 2011-01-06 2 0.023
13: 2011-01-01 3 0.025
14: 2011-01-02 3 0.027
15: 2011-01-03 3 0.029
16: 2011-01-04 3 0.031
17: 2011-01-05 3 0.033
18: 2011-01-06 3 0.035
The above can be created as :
DT = data.table(
date=rep(as.Date('2011-01-01')+0:5,3) ,
stock_id=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
logret=seq(0.001, by=0.002, len=18));
setkeyv(DT,c('stock_id','date'))
Of course the real table is larger with many more stock_ids and dates. The aim to to reshape this data table such that I can run a regression of all stockid log_returns with their corresponding log_returns with a lag of 1 day (or prior traded day in case of weekends).
The final results would look like :
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
....
16: 2011-01-04 3 0.031 0.029
17: 2011-01-05 3 0.033 0.031
18: 2011-01-06 3 0.035 0.033
I'm finding this data structure really tricky to build without mixing up my stockid.
Just some additional notes due to Alex's comment. The reason you have difficulties understanding what's going on here is that a lot of things are done within one line. So it's always a good idea to break things down.
What do we actually want? We want a new column lagret
and the syntax to add a new column in data.table is the following:
DT[, lagret := xxx]
where xxx
has to be filled up with whatever you want to have in column lagret
. So if we just want a new column that gives us the rows, we could just call
DT[, lagret := seq(from=1, to=nrow(DT))]
Here, we actually want the lagged value of logret
, but we have to consider that there are many stocks in here. That's why we do a self-join, i.e. we join the data.table DT
with itself by the columns stock_id
and date
, but since we want the previous value of each stock, we use date-1
. Note that we have to set the keys first to do such a join:
setkeyv(DT,c('stock_id','date'))
DT[list(stock_id,date-1)]
stock_id date logret
1: 1 2010-12-31 NA
2: 1 2011-01-01 0.001
3: 1 2011-01-02 0.003
4: 1 2011-01-03 0.005
5: 1 2011-01-04 0.007
6: 1 2011-01-05 0.009
...
As you can see, we now have what we want. logret
is now lagged by one period. But we actually want that in a new column lagret
in DT
, so we just get that column by calling [[3L]] (this means nothing else then get me the third column) and name this new column lagret
:
DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-04 1 0.007 0.005
5: 2011-01-05 1 0.009 0.007
...
This is already the correct solution. In this simple case, we do not need roll=TRUE
because there are no gaps in the dates. However, in a more realistic example (as mentioned above, for instance when we have weekends), there might be gaps. So let's make such a realistic example by just deleting two days in the DT
for the first stock:
DT <- DT[-c(4, 5)]
setkeyv(DT,c('stock_id','date'))
DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-06 1 0.011 NA
5: 2011-01-01 2 0.013 NA
...
As you can see, the problem is now that we don't have a value for the 6th of January. That's why we use roll=TRUE
:
DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-06 1 0.011 0.005
5: 2011-01-01 2 0.013 NA
...
Just have a look on the documentation on how roll=TRUE
works exactly. In a nutshell: If it can't find the previous value (here logret
for the 5th of January), it just takes the last available one (here from the 3rd of January).
Update:
In the current development version of data.table, v1.9.5, shift()
is implemented #965, which takes two types at the moment type = "lag"
(default) and type = "lead"
. See ?shift
for more on usage.
With this, we can simply do:
# type="lag" may be omitted, as it is the default.
require(data.table) ## 1.9.5+
DT[, lagret := shift(logret, 1L, type="lag"), by=stock_id]
# date stock_id logret lagret
# 1: 2011-01-01 1 0.001 NA
# 2: 2011-01-02 1 0.003 0.001
# 3: 2011-01-03 1 0.005 0.003
# 4: 2011-01-04 1 0.007 0.005
# 5: 2011-01-05 1 0.009 0.007
# 6: 2011-01-06 1 0.011 0.009
# 7: 2011-01-01 2 0.013 NA
# 8: 2011-01-02 2 0.015 0.013
# 9: 2011-01-03 2 0.017 0.015
# 10: 2011-01-04 2 0.019 0.017
# 11: 2011-01-05 2 0.021 0.019
# 12: 2011-01-06 2 0.023 0.021
# 13: 2011-01-01 3 0.025 NA
# 14: 2011-01-02 3 0.027 0.025
# 15: 2011-01-03 3 0.029 0.027
# 16: 2011-01-04 3 0.031 0.029
# 17: 2011-01-05 3 0.033 0.031
# 18: 2011-01-06 3 0.035 0.033
Thanks to Matthew Dowle's advice, I was able to use the following :
DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
Results are :
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-04 1 0.007 0.005
5: 2011-01-05 1 0.009 0.007
6: 2011-01-06 1 0.011 0.009
7: 2011-01-01 2 0.013 NA
8: 2011-01-02 2 0.015 0.013
9: 2011-01-03 2 0.017 0.015
10: 2011-01-04 2 0.019 0.017
11: 2011-01-05 2 0.021 0.019
12: 2011-01-06 2 0.023 0.021
13: 2011-01-01 3 0.025 NA
14: 2011-01-02 3 0.027 0.025
15: 2011-01-03 3 0.029 0.027
16: 2011-01-04 3 0.031 0.029
17: 2011-01-05 3 0.033 0.031
18: 2011-01-06 3 0.035 0.033