Consider the following data.table
s. The first defines a set of regions with start and end positions for each group
library(data.table)
d1 <- data.table(x=letters[1:5], start=c(1,5,19,30, 7), end=c(3,11,22,39,25))
setkey(d1, x,start)
# x start end
# 1: a 1 3
# 2: b 5 11
# 3: c 19 22
# 4: d 30 39
# 5: e 7 25
The second represents observations for each group
d2 <- data.table(x=letters[c(1,1,2,2,3:5)], pos=c(2,3,3,12,20,52,10))
setkey(d2, x,pos)
# x pos
# 1: a 2
# 2: a 3
# 3: b 3
# 4: b 12
# 5: c 20
# 6: d 52
# 7: e 10
Ultimately i'd like to be able to extract the rows in d2 that are in a region for the matching x
value in d1. The desired result is
# x pos start end
# 1: a 2 1 3
# 2: a 3 1 3
# 3: c 20 19 22
# 4: e 10 7 25
The start/end positions for any group x
will never overlap but there may be gaps of values not in any region.
Now, I believe I should be using a rolling join. From what i can tell, I cannot use the "end" column in the join.
I've tried
d1[d2, roll=T, nomatch=0, mult="all"][start<=end]
and got
# x start end
# 1: a 2 3
# 2: a 3 3
# 3: c 20 22
# 4: e 10 25
which is the right set of rows I want; However "pos" has become "start" and the original "start" has been lost. Is there a way to preserve all the columns with the roll join so i could report "start","pos","end" as desired?
Overlap joins was implemented with commit 1375 in data.table v1.9.3, and is available in the current stable release, v1.9.4. The function is called
foverlaps
. From NEWS:Let's consider x, an interval defined as
[a, b]
, wherea <= b
, and y, another interval defined as[c, d]
, wherec <= d
. The interval y is said to overlap x at all, iffd >= a
andc <= b
1. And y is entirely contained within x, iffa <= c,d <= b
2. For the different types of overlaps implemented, please have a look at?foverlaps
.Your question is a special case of an overlap join: in
d1
you have true physical intervals withstart
andend
positions. Ind2
on the other hand, there are only positions (pos
), not intervals. To be able to do an overlap join, we need to create intervals also ind2
. This is achieved by creating an additional variablepos2
, which is identical topos
(d2[, pos2 := pos]
). Thus, we now have an interval ind2
, albeit with identical start and end coordinates. This 'virtual, zero-width interval' ind2
can then be used infoverlap
to do an overlap join withd1
:by.y
by default iskey(y)
, so we skipped it.by.x
by default takeskey(x)
if it exists, and if not takeskey(y)
. But a key doesn't exist ford2
, and we can't set the columns fromy
, because they don't have the same names. So, we setby.x
explicitly.The type of overlap is within, and we'd like to have all matches, only if there is a match.
NB:
foverlaps
uses data.table's binary search feature (along withroll
where necessary) under the hood, but some function arguments (types of overlaps, maxgap, minoverlap etc..) are inspired by the functionfindOverlaps()
from the Bioconductor packageIRanges
, an excellent package (and so isGenomicRanges
, which extendsIRanges
for Genomics).So what's the advantage?
A benchmark on the code above on your data results in
foverlaps()
slower than Gabor's answer (Timings: Gabor's data.table solution = 0.004 vs foverlaps = 0.021 seconds). But does it really matter at this granularity?What would be really interesting is to see how well it scales - in terms of both speed and memory. In Gabor's answer, we join based on the key column
x
. And then filter the results.What if
d1
has about 40K rows andd2
has a 100K rows (or more)? For each row ind2
that matchesx
ind1
, all those rows will be matched and returned, only to be filtered later. Here's an example of your Q scaled only slightly:Generate data:
foverlaps:
This took ~ 1GB of memory in total, out of which
ans1
is 420MB. Most of the time spent here is on subset really. You can check it by setting the argumentverbose=TRUE
.Gabor's solutions:
And this took a total of ~3.5GB.
I just noted that Gabor already mentions the memory required for intermediate results. So, trying out
sqldf
:Took a total of ~1.4GB. So, it definitely uses less memory than the one shown above.
[The answers were verified to be identical after removing
pos2
fromans1
and setting key on both answers.]Note that this overlap join is designed with problems where
d2
doesn't necessarily have identical start and end coordinates (ex: genomics, the field where I come from, whered2
is usually about 30-150 million or more rows).foverlaps()
is stable, but is still under development, meaning some arguments and names might get changed.NB: Since I mentioned
GenomicRanges
above, it is also perfectly capable of solving this problem. It uses interval trees under the hood, and is quite memory efficient as well. In my benchmarks on genomics data,foverlaps()
is faster. But that's for another (blog) post, some other time.1) sqldf This is not data.table but complex join criteria are easy to specify in a straight forward manner in SQL:
giving:
2) data.table For a data.table answer try this:
giving:
Note that this does have the disadvantage of forming the possibly large intermeidate result
d1[d2]
which SQL may not do. The remaining solutions may have this problem too.3) dplyr This suggests the corresponding dplyr solution. We also use
between
from data.table:giving:
4) merge/subset Using only the base of R:
giving:
Added Note that the data table solution here is much faster than the one in the other answer:
data.table v1.9.8+
has a new feature - non-equi joins. With that, this operation becomes even more straightforward: