roll join with start/end window

Consider the following data.tables. The first defines a set of regions with start and end positions for each group

library(data.table)

d1 <- data.table(x=letters[1:5], start=c(1,5,19,30, 7), end=c(3,11,22,39,25))
setkey(d1, x,start)

#    x start end
# 1: a     1   3
# 2: b     5  11
# 3: c    19  22
# 4: d    30  39
# 5: e     7  25

The second represents observations for each group

d2 <- data.table(x=letters[c(1,1,2,2,3:5)], pos=c(2,3,3,12,20,52,10))
setkey(d2, x,pos)

#    x pos
# 1: a   2
# 2: a   3
# 3: b   3
# 4: b  12
# 5: c  20
# 6: d  52
# 7: e  10

Ultimately i'd like to be able to extract the rows in d2 that are in a region for the matching x value in d1. The desired result is

#    x pos start  end
# 1: a   2     1    3
# 2: a   3     1    3
# 3: c  20    19   22
# 4: e  10     7   25

The start/end positions for any group x will never overlap but there may be gaps of values not in any region.

Now, I believe I should be using a rolling join. From what i can tell, I cannot use the "end" column in the join.

I've tried

d1[d2, roll=T, nomatch=0, mult="all"][start<=end]

and got

#    x start end
# 1: a     2   3
# 2: a     3   3
# 3: c    20  22
# 4: e    10  25

which is the right set of rows I want; However "pos" has become "start" and the original "start" has been lost. Is there a way to preserve all the columns with the roll join so i could report "start","pos","end" as desired?

标签： r merge data.table

3条回答

与风俱净

2楼-- · 2018-12-31 04:32

Overlap joins was implemented with commit 1375 in data.table v1.9.3, and is available in the current stable release, v1.9.4. The function is called foverlaps. From NEWS:

29) Overlap joins #528 is now here, finally!! Except for type="equal" and maxgap and minoverlap arguments, everything else is implemented. Check out ?foverlaps and the examples there on its usage. This is a major feature addition to data.table.

Let's consider x, an interval defined as [a, b], where a <= b, and y, another interval defined as [c, d], where c <= d. The interval y is said to overlap x at all, iff d >= a and c <= b 1. And y is entirely contained within x, iff a <= c,d <= b 2. For the different types of overlaps implemented, please have a look at ?foverlaps.

Your question is a special case of an overlap join: in d1 you have true physical intervals with start and end positions. In d2 on the other hand, there are only positions (pos), not intervals. To be able to do an overlap join, we need to create intervals also in d2. This is achieved by creating an additional variable pos2, which is identical to pos (d2[, pos2 := pos]). Thus, we now have an interval in d2, albeit with identical start and end coordinates. This 'virtual, zero-width interval' in d2 can then be used in foverlap to do an overlap join with d1:

require(data.table) ## 1.9.3
setkey(d1)
d2[, pos2 := pos]
foverlaps(d2, d1, by.x = names(d2), type = "within", mult = "all", nomatch = 0L)
#    x start end pos pos2
# 1: a     1   3   2    2
# 2: a     1   3   3    3
# 3: c    19  22  20   20
# 4: e     7  25  10   10

by.y by default is key(y), so we skipped it. by.x by default takes key(x) if it exists, and if not takes key(y). But a key doesn't exist for d2, and we can't set the columns from y, because they don't have the same names. So, we set by.x explicitly.

The type of overlap is within, and we'd like to have all matches, only if there is a match.

NB: foverlaps uses data.table's binary search feature (along with roll where necessary) under the hood, but some function arguments (types of overlaps, maxgap, minoverlap etc..) are inspired by the function findOverlaps() from the Bioconductor package IRanges, an excellent package (and so is GenomicRanges, which extends IRanges for Genomics).

So what's the advantage?

A benchmark on the code above on your data results in foverlaps() slower than Gabor's answer (Timings: Gabor's data.table solution = 0.004 vs foverlaps = 0.021 seconds). But does it really matter at this granularity?

What would be really interesting is to see how well it scales - in terms of both speed and memory. In Gabor's answer, we join based on the key column x. And then filter the results.

What if d1 has about 40K rows and d2 has a 100K rows (or more)? For each row in d2 that matches x in d1, all those rows will be matched and returned, only to be filtered later. Here's an example of your Q scaled only slightly:

Generate data:

require(data.table)
set.seed(1L)
n = 20e3L; k = 100e3L
idx1 = sample(100, n, TRUE)
idx2 = sample(100, n, TRUE)
d1 = data.table(x = sample(letters[1:5], n, TRUE), 
                start = pmin(idx1, idx2), 
                end = pmax(idx1, idx2))

d2 = data.table(x = sample(letters[1:15], k, TRUE), 
                pos1 = sample(60:150, k, TRUE))

foverlaps:

system.time({
    setkey(d1)
    d2[, pos2 := pos1]
    ans1 = foverlaps(d2, d1, by.x=1:3, type="within", nomatch=0L)
})
# user  system elapsed 
#   3.028   0.635   3.745

This took ~ 1GB of memory in total, out of which ans1 is 420MB. Most of the time spent here is on subset really. You can check it by setting the argument verbose=TRUE.

Gabor's solutions:

## new session - data.table solution
system.time({
    setkey(d1, x)
    ans2 <- d1[d2, allow.cartesian=TRUE, nomatch=0L][between(pos1, start, end)]
})
#   user  system elapsed 
# 15.714   4.424  20.324

And this took a total of ~3.5GB.

I just noted that Gabor already mentions the memory required for intermediate results. So, trying out sqldf:

# new session - sqldf solution
system.time(ans3 <- sqldf("select * from d1 join 
            d2 using (x) where pos1 between start and end"))
#   user  system elapsed 
# 73.955   1.605  77.049

Took a total of ~1.4GB. So, it definitely uses less memory than the one shown above.

[The answers were verified to be identical after removing pos2 from ans1 and setting key on both answers.]

Note that this overlap join is designed with problems where d2 doesn't necessarily have identical start and end coordinates (ex: genomics, the field where I come from, where d2 is usually about 30-150 million or more rows).

foverlaps() is stable, but is still under development, meaning some arguments and names might get changed.

NB: Since I mentioned GenomicRanges above, it is also perfectly capable of solving this problem. It uses interval trees under the hood, and is quite memory efficient as well. In my benchmarks on genomics data, foverlaps() is faster. But that's for another (blog) post, some other time.

0人赞添加讨论(0) 举报

旧人旧事旧时光

3楼-- · 2018-12-31 04:38

1) sqldf This is not data.table but complex join criteria are easy to specify in a straight forward manner in SQL:

library(sqldf)

sqldf("select * from d1 join d2 using (x) where pos between start and end")

giving:

  x start end pos
1 a     1   3   2
2 a     1   3   3
3 c    19  22  20
4 e     7  25  10

2) data.table For a data.table answer try this:

library(data.table)

setkey(d1, x)
setkey(d2, x)
d1[d2][between(pos, start, end)]

giving:

   x start end pos
1: a     1   3   2
2: a     1   3   3
3: c    19  22  20
4: e     7  25  10

Note that this does have the disadvantage of forming the possibly large intermeidate result d1[d2] which SQL may not do. The remaining solutions may have this problem too.

3) dplyr This suggests the corresponding dplyr solution. We also use between from data.table:

library(dplyr)
library(data.table) # between

d1 %>% 
   inner_join(d2) %>% 
   filter(between(pos, start, end))

giving:

Joining by: "x"
  x start end pos
1 a     1   3   2
2 a     1   3   3
3 c    19  22  20
4 e     7  25  10

4) merge/subset Using only the base of R:

subset(merge(d1, d2), start <= pos & pos <= end)

giving:

   x start end pos
1: a     1   3   2
2: a     1   3   3
3: c    19  22  20
4: e     7  25  10

Added Note that the data table solution here is much faster than the one in the other answer:

dt1 <- function() {
 d1 <- data.table(x=letters[1:5], start=c(1,5,19,30, 7), end=c(3,11,22,39,25))
 d2 <- data.table(x=letters[c(1,1,2,2,3:5)], pos=c(2,3,3,12,20,52,10))
 setkey(d1, x, start)
 idx1 = d1[d2, which=TRUE, roll=Inf] # last observation carried forwards

 setkey(d1, x, end)
 idx2 = d1[d2, which=TRUE, roll=-Inf] # next observation carried backwards

 idx = which(!is.na(idx1) & !is.na(idx2))
 ans1 <<- cbind(d1[idx1[idx]], d2[idx, list(pos)])
}

dt2 <- function() {
 d1 <- data.table(x=letters[1:5], start=c(1,5,19,30, 7), end=c(3,11,22,39,25))
 d2 <- data.table(x=letters[c(1,1,2,2,3:5)], pos=c(2,3,3,12,20,52,10))
 setkey(d1, x)
 ans2 <<- d1[d2][between(pos, start, end)]
}

all.equal(as.data.frame(ans1), as.data.frame(ans2))
## TRUE

benchmark(dt1(), dt2())[1:4]
##     test replications elapsed relative
##  1 dt1()          100    1.45    1.667  
##  2 dt2()          100    0.87    1.000  <-- from (2) above

0人赞添加讨论(0) 举报

谁念西风独自凉

4楼-- · 2018-12-31 04:56

data.table v1.9.8+ has a new feature - non-equi joins. With that, this operation becomes even more straightforward:

require(data.table) #v1.9.8+
# no need to set keys on `d1` or `d2`
d2[d1, .(x, pos=x.pos, start, end), on=.(x, pos>=start, pos<=end), nomatch=0L]
#    x pos start end
# 1: a   2     1   3
# 2: a   3     1   3
# 3: c  20    19  22
# 4: e  10     7  25

0人赞添加讨论(0) 举报