Combine set of conditions in data.table to extract

2019-07-13 11:58发布

问题:

Since my terrible execution and interpretation of my previous question I'll start over and will try to formulate the question as short and general possible.

I have two dataframes (see the examples below). Each dataset contains the same number of columns.

tc <- textConnection('
ID  Track1  Track2  Track3  Track4  Time    Loc
4   15      ""      ""      50      40      1   
5   17      115     109     55      50      1   
6   17      115     109     55      60      1   
7   13      195     150     60      70      1
8   13      195     150     60      80      1
9   ""      ""      181     70      90      2 #From this row, example data added
10  ""      ""      182     70      92      2
11  429     31      ""      80      95      3
12  480     31      12      80      96      3 
13  118     ""      ""      90      100     4
14  120     16      213     90      101     4   
')

MATCHINGS <- read.table(tc, header=TRUE)

tc <- textConnection('
ID  Track1  Track2  Track3  Track4  Time    Loc
""  15      ""      ""      50      40      1   
""  17      ""     109      55      50      1
""  17      432    109      55      65      1   
""  17      115     109     55      59      1       
""  13      195     150     60      68      1
""  13      195     150     60      62      1
""  10      5       1       10      61      3
""  13      195     150     60      72      1
""  40      ""      181     70      82      2 #From this row, example data added
""  ""      ""      182     70      85      2
""  429     ""      ""      80      90      3
""  ""      31      12      80      92      3
""  ""      ""      ""      90      95      4
""  118     16      213     90      96      4
')

INVOLVED <- read.table(tc, header=TRUE)

The goal is to place the least recent ID from MATCHINGS into INVOLVED by matching on Track1 to Track4 and Loc. An extra condition is that the Time of the matching INVOLVED entry may not be higher than the Time of the entry in MATCHING. Furthermore a match on Track1 is most preferred, a match on Track4 is least preferred. However only Track4 is always available (all other Track-columns can be empty). Thus the expected results are:

ID Track1 Track2 Track3 Track4 Time Loc
4     15     ""     ""     50   40   1
5     17     ""    109     55   50   1
""    17    432    109     55   65   1
6     17    115    109     55   59   1
7     13    195    150     60   68   1
7     13    195    150     60   62   1
""    10      5      1     10   61   3
8     13    195    150     60   72   1
9     40     ""    181     70   82   2 #From this row, example data added
10    ""     ""    182     70   85   2
11    429    ""     ""     80   90   3
12    ""     31     12     80   92   3
13    ""     ""     ""     90   95   4 
13    118    16    213     90   96   4

I tried to this with the data.table package, but fail in doing this efficient. Is it possible to get rid of the vector scans and efficiently go through the data without looping?

dat <- data.table(MATCHINGS)
for(i in 1:nrow(INVOLVED)){
    row <- INVOLVED[i,]
    match <- dat[Time>=row$Time][Loc==row$Loc][Track4==row$Track4][Track4!=""][order(Time)][1]
    if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
    match <- dat[Time>=row$Time][Loc==row$Loc][Track3==row$Track3][Track3!=""][order(Time)][1]
    if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
    match <- dat[Time>=row$Time][Loc==row$Loc][Track2==row$Track2][Track2!=""][order(Time)][1]
    if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
    match <- dat[Time>=row$Time][Loc==row$Loc][Track1==row$Track1][Track1!=""][order(Time)][1]
    if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
}

update

Updated the example data showing the need for Track 1 to 3. As shown Track1 is most important and Track4 least important. Even if Track1 to 3 match to MATCHINGS x and Track4 matches to MATCHINGS y, the ID of y should be assigned to that INVOLVED row. So: Track3 match overrides Track4 match, Track2 match overrides Track3 match, Track1 match overrides Track2 match.

回答1:

With roll argument able to also roll next observation backward along with the new (v1.9.6+) on= argument, we can do this much more straightforward:

require(data.table)
setDT(MATCHINGS)
setDT(INVOLVED)
INVOLVED[ , ID := MATCHINGS[INVOLVED, ID, roll=-Inf, 
                    mult="first", on=c("Loc", "Track4", "Time")]]]

That's it.


Here's a data.table-ish start. This only uses Track 4 (not 1 to 3) but it still appears to produce the requested output.

M = as.data.table(MATCHINGS)
I = as.data.table(INVOLVED)
M[,Time:=-Time]
I[,Time:=-Time]
setkey(M,Loc,Track4,Time)
I[,ID:={i=list(Loc,Track4,Time);M[i,ID,roll=TRUE,mult="first"]}][,Time:=-Time]

    ID Track1 Track2 Track3 Track4 Time Loc
 1:  1     NA    105     NA     35    1   1
 2:  1     NA     NA     NA     35    2   1
 3:  1     26    105     NA     35    3   1
 4:  2     NA     NA     NA     40   20   1
 5:  2    134      1      6     40   20   1
 6:  3     13    109     NA     45   30   1
 7:  4     15     NA     NA     50   40   1
 8:  5     17     NA    109     55   50   1
 9: NA     17    432    109     55   65   1
10:  6     17    115    109     55   59   1
11:  7     13    195    150     60   68   1
12:  7     13    195    150     60   62   1
13: NA     10      5      1     10   61   3
14:  8     13    195    150     60   72   1

Interesting question! If this seems ok, please change the example data to need tracks 1 to 3. Or perhaps you can take it from here.