conditional data.table match for subset of data.ta

2019-07-22 07:27发布

问题:

This post is related to the previous post here: match rows of two data.tables to fill subset of a data.table

Not sure how I can integrate them together. I have a situation where other than the NA for one column of DT1, a couple of more conditions should apply for merging, but that doesn't work.

> DT1 <- data.table(colA = c(1,1, 2,2,2,3,3), colB = c('A', NA, 'AA', 'B', NA, 'A', 'C'), timeA = c(2,4,3,4,6,1,4))
> DT1
   colA colB timeA
1:    1    A     2
2:    1 <NA>     4
3:    2   AA     3
4:    2    B     4
5:    2 <NA>     6
6:    3    A     1
7:    3    C     4
> DT2 <- data.table(colC = c(1,1,1,2,2,3), timeB1 = c(1,3,6, 2,4, 1), timeB2 = c(2,5,7,3,5,4), colD = c('Z', 'YY', 'AB', 'JJ', 'F', 'RR'))
> DT2
   colC timeB1 timeB2 colD
1:    1      1      2    Z
2:    1      3      5   YY
3:    1      6      7   AB
4:    2      2      3   JJ
5:    2      4      5    F
6:    3      1      4   RR

Using the same guideline as mentioned above, I'd like to merge ColD of DT2 to colB of DT1 only for NA values of colB in DT1 AND use the values of colD for which timeA in DT1 is between timeB1 and timeB2 in DT2. I tried the following but merge doesn't happen:

 > output <- DT1[DT2, on = .(colA = colC), colB := ifelse(is.na(x.colB) & i.timeB1 <= x.timeA & x.timeA <= i.timeB2, i.colD, x.colB)]
> output
> output
   colA colB timeA
1:    1    A     2
2:    1 <NA>     4
3:    2   AA     3
4:    2    B     4
5:    2 <NA>     6
6:    3    A     1
7:    3    C     4

Nothing changes in output. these is my desired output:

> desired_output
   colA colB timeA
1:    1    A     2
2:    1   YY     4   --> should find a match
3:    2   AA     3
4:    2    B     4
5:    2 <NA>     6   --> shouldn't find a match
6:    3    A     1
7:    3    C     4

why doesn't this work? I'd like to use data.table operations only without using additional packages.

回答1:

An in place update of the colB in DT1 would work as follows:

DT1[is.na(colB), colB := DT2[DT1[is.na(colB)], 
                    on = .(colC = colA, timeB1 <= timeA, timeB2 >= timeA), colD]]
print(DT1)
   colA colB timeA
1:    1    A     2
2:    1   YY     4
3:    2   AA     3
4:    2    B     4
5:    2 <NA>     6
6:    3    A     1
7:    3    C     4

This indexes the values where colB is NA and after a join on the condition, as defined in on= ..., replaces the missing values by the matching values found in colD.



回答2:

Possibly not the sortest answer, but it gets the job done.. I'm no data.table-expert, so I welcome improvements/suggestions.

DT1[ is.na(colB), colB := DT1[ is.na(colB), ][ DT2, colB := i.colD, on = c( "colA == colC", "timeA >= timeB1", "timeA <= timeB2")]$colB]

what is does:
first, subset DT1 for all rows where is.na(colB) = TRUE
then, update the value of colB in these rows with the colB-vector from the result of a non-equi join of the same subset of rows on DT2

Bonus is that DT1 is chaged by reference, so it's pretty fast and memory efficient on large data (I think).

   colA colB timeA
1:    1    A     2
2:    1   YY     4
3:    2   AA     3
4:    2    B     4
5:    2 <NA>     6
6:    3    A     1
7:    3    C     4