Rolling join grouped by a second variable in data.

2020-02-01 18:27发布

问题:

Hello I would like to perform a rolling join in R using the data.table package. There are multiple matches when joining on the "Date" column so I would like to use the by argument in data.table on the "Field" column to keep data from different fields from being joined together.

Example Data

d1<-structure(list(Field = c("6", "W62", "6", "6", "12S", "19-1", 
"6", "6", "12S", "7", "6", "12S", "W62", "6", "12S", "W62", "12S", 
"6", "6", "7", "12S", "12S", "W62", "7", "12S", "6", "12S", "7", 
"12S", "7", "6", "7", "12S", "7", "6", "6", "6", "6", "12S", 
"7", "7", "6", "6", "12S", "7", "12S", "12S", "12S", "19-1", 
"6"), Date = structure(c(16994, 17240, 17240, 17401, 17048, 17417, 
17387, 17394, 17382, 17414, 17029, 17403, 17045, 17359, 17179, 
17281, 17152, 16972, 16987, 17042, 17282, 17415, 17281, 17266, 
17179, 17190, 17057, 17380, 17280, 17178, 17178, 17343, 17373, 
17043, 17190, 17343, 17253, 16981, 17079, 17043, 17270, 17366, 
16981, 17357, 17366, 17415, 17079, 17190, 17385, 17008), class = "Date"), 
    NlbsAcre = c(NA, 18874.6557383659, 2477.08251404958, NA, 
    NA, 19658.0054165823, NA, NA, 12621.0827111083, NA, NA, 16764.41968227, 
    16764.9745173044, NA, 7671.24950330348, 21341.6444661863, 
    5197.26333885612, NA, NA, NA, 39560.8958554292, 18162.4040880297, 
    22578.1487456647, 15842.9161753361, 3613.95523726973, 2601.07083566694, 
    17766.9873538952, NA, 44728.1837479613, 2279.60909695434, 
    2014.7720270382, NA, 14847.7006686211, NA, 3082.31758038481, 
    NA, 2427.53558465175, NA, 23641.2999848709, NA, NA, NA, NA, 
    5928.31591997149, NA, 22162.2028819815, 18972.2228621189, 
    6534.4257935542, 12630.9231775315, NA)), .Names = c("Field", 
"Date", "NlbsAcre"), class = c("data.table", "data.frame"), row.names = c(NA, 
-50L), .internal.selfref = <pointer: 0x0000000006540788>)

d2<-structure(list(Field = c("6", "W62", "7", "12S", "19-1", "12S", 
"6", "6", "19-1", "19-1", "6", "7", "W62", "19-1", "12S", "7", 
"19-1", "7", "12S", "12S", "12S", "7", "6", "7", "6", "7", "W62", 
"19-1", "6", "6", "12S", "12S", "6", "6", "12S", "6", "12S", 
"19-1", "6", "W62", "W62", "6", "7", "7", "6", "19-1", "W62", 
"6", "12S", "7"), Date = structure(c(16993, 17140, 17208, 17443, 
17063, 16948, 17415, 16926, 17316, 16922, 16981, 17043, 17219, 
17252, 17392, 17244, 17179, 17017, 17042, 17031, 17013, 17104, 
17273, 16954, 17364, 16993, 17168, 17028, 17208, 16966, 17241, 
16945, 17038, 17169, 17379, 17183, 17238, 17054, 17244, 16952, 
17044, 17359, 17219, 17303, 17007, 17151, 16926, 17178, 17382, 
17364), class = "Date"), TotN = c(79.244802845739, 94.193700050628, 
21.075505564932, 692.152760834712, 224.689064446728, 172.576578578436, 
47.406177406404, 102.53239575903, 818.80997295717, 476.174916307807, 
125.828033450364, 58.270026966444, 75.465909993456, 435.049246131543, 
337.913876678769, 31.714327953234, 305.353940577156, 72.621457768224, 
393.815453005314, 428.540114240892, 318.97091713563, 73.888113736431, 
79.0380747113805, 147.493527174027, 65.5311189906495, 59.269732271703, 
119.390398108236, 110.706003557451, 21.96790939404, 149.060445984684, 
128.143343232486, 208.621943093862, 75.770138571561, 47.496596179338, 
132.723654607278, 43.92222198012, 145.150910469252, 215.88105225024, 
21.393670871196, 72.969536052, 86.335878117078, 103.524169592979, 
19.920230115264, 44.968722966108, 62.244487239885, 338.593490463303, 
96.7285416279375, 45.537296152302, 422.630318314444, 58.5336350807685
)), .Names = c("Field", "Date", "TotN"), class = c("data.table", 
"data.frame"), row.names = c(NA, -50L), .internal.selfref = <pointer: 0x0000000006540788>)

What I have tried

This is my attempt to perform a rolling join on the "Date" column, grouped by the "Field" column. Obviously I can split the data by "Field" and process separately but I would like to avoid that option.

d1[d2, roll = "nearest", on = .(Date), by = .(Field)]

Error in `[.data.table`(d1, d2, roll = "nearest", on = .(Date), by = .(Field)) : 
  'by' or 'keyby' is supplied but not j

回答1:

You were almost there.

You can join on multiple columns simultaneously. So, in addition to "Date", you can include "Field" in the on clause. But please note the description of the roll argument in ?data.table:

Rolling joins apply to the last join column

Thus, for "Date" to be used for the rolling join, specify it as the last variable in on:

library(data.table)
d1[d2, roll = "nearest", on = .(Field, Date)]

For better verification, the result can be ordered

d1[d2, roll = "nearest", on = .(Field, Date)][order(Field, Date)]
    Field       Date  NlbsAcre      TotN
 1:   12S 2016-05-24        NA 208.62194
 2:   12S 2016-05-27        NA 172.57658
 3:   12S 2016-07-31        NA 318.97092
 4:   12S 2016-08-18        NA 428.54011
 5:   12S 2016-08-29        NA 393.81545
 6:   12S 2017-03-13 44728.184 145.15091
 7:   12S 2017-03-16 44728.184 128.14334
 8:   12S 2017-08-01 12621.083 132.72365
 9:   12S 2017-08-04 12621.083 422.63032
10:   12S 2017-08-14 12621.083 337.91388
11:   12S 2017-10-04 22162.203 692.15276
12:  19-1 2016-05-01 12630.923 476.17492
13:  19-1 2016-08-15 12630.923 110.70600
14:  19-1 2016-09-10 12630.923 215.88105
15:  19-1 2016-09-19 12630.923 224.68906
16:  19-1 2016-12-16 12630.923 338.59349
17:  19-1 2017-01-13 12630.923 305.35394
18:  19-1 2017-03-27 12630.923 435.04925
19:  19-1 2017-05-30 12630.923 818.80997
20:     6 2016-05-05        NA 102.53240
21:     6 2016-06-14        NA 149.06045
22:     6 2016-06-29        NA 125.82803
23:     6 2016-06-29        NA 125.82803
24:     6 2016-07-11        NA  79.24480
25:     6 2016-07-25        NA  62.24449
26:     6 2016-08-25        NA  75.77014
27:     6 2017-01-03  2014.772  47.49660
28:     6 2017-01-12  2014.772  45.53730
29:     6 2017-01-17  2014.772  43.92222
30:     6 2017-02-11  3082.318  21.96791
31:     6 2017-03-19  2477.083  21.39367
32:     6 2017-04-17  2427.536  79.03807
33:     6 2017-07-12        NA 103.52417
34:     6 2017-07-17        NA  65.53112
35:     6 2017-09-06        NA  47.40618
36:     7 2016-06-02        NA 147.49353
37:     7 2016-07-11        NA  59.26973
38:     7 2016-08-04        NA  72.62146
39:     7 2016-08-30        NA  58.27003
40:     7 2016-08-30        NA  58.27003
41:     7 2016-10-30        NA  73.88811
42:     7 2017-02-11  2279.609  21.07551
43:     7 2017-02-22  2279.609  19.92023
44:     7 2017-03-19 15842.916  31.71433
45:     7 2017-05-17        NA  44.96872
46:     7 2017-07-17        NA  58.53364
47:   W62 2016-05-05 16764.975  96.72854
48:   W62 2016-05-31 16764.975  72.96954
49:   W62 2016-08-31 16764.975  86.33588
50:   W62 2016-12-05 16764.975  94.19370
51:   W62 2017-01-02 18874.656 119.39040
52:   W62 2017-02-22 18874.656  75.46591
    Field       Date  NlbsAcre      TotN