可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have two data.frames each with three columns: chrom, start & stop, let's call them rangesA and rangesB. For each row of rangesA, I'm looking to find which (if any) row in rangesB fully contains the rangesA row - by which I mean rangesAChrom == rangesBChrom, rangesAStart >= rangesBStart and rangesAStop <= rangesBStop
.
Right now I'm doing the following, which I just don't like very much. Note that I'm looping over the rows of rangesA for other reasons, but none of those reasons are likely to be a big deal, it just ends up making things more readable given this particular solution.
rangesA:
chrom start stop
5 100 105
1 200 250
9 275 300
rangesB:
chrom start stop
1 200 265
5 99 106
9 275 290
for each row in rangesA:
matches <- which((rangesB[,'chrom'] == rangesA[row,'chrom']) &&
(rangesB[,'start'] <= rangesA[row, 'start']) &&
(rangesB[,'stop'] >= rangesA[row, 'stop']))
I figure there's got to be a better (and by better, I mean faster over large instances of rangesA and rangesB) way to do this than looping over this construct. Any ideas?
回答1:
This would be a lot easier / faster if you can merge the two objects first.
ranges <- merge(rangesA,rangesB,by="chrom",suffixes=c("A","B"))
ranges[with(ranges, startB <= startA & stopB >= stopA),]
# chrom startA stopA startB stopB
#1 1 200 250 200 265
#2 5 100 105 99 106
回答2:
Use the IRanges/GenomicRanges packages from Bioconductor, which is made for dealing with these exact problems (and scales massively)
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
There are a few appropriate containers for ranges on different chromosomes, one is RangesList
library(IRanges)
rangesA <- split(IRanges(rangesA$start, rangesA$stop), rangesA$chrom)
rangesB <- split(IRanges(rangesB$start, rangesB$stop), rangesB$chrom)
#which rangesB wholly contain at least one rangesA?
ov <- countOverlaps(rangesB, rangesA, type="within")>0
回答3:
The data.table
package has a function foverlaps()
which is capable of merging over interval ranges since v1.9.4:
require(data.table)
setDT(rangesA)
setDT(rangesB)
setkey(rangesB)
foverlaps(rangesA, rangesB, type="within", nomatch=0L)
# chrom start stop i.start i.stop
# 1: 5 99 106 100 105
# 2: 1 200 265 200 250
setDT()
converts data.frame to data.table by reference
setkey()
sorts the data.table by the columns provided (in this case all columns, since we did not provide any), and marks those columns as sorted, which we'll use later to perform the join on.
foverlaps()
does the overlapping join efficiently. See this answer for a detailed explanation and comparison to other approaches.
回答4:
For your example data:
rangesA <- data.frame(
chrom = c(5, 1, 9),
start = c(100, 200, 275),
stop = c(105, 250, 300)
)
rangesB <- data.frame(
chrom = c(1, 5, 9),
start = c(200, 99, 275),
stop = c(265, 106, 290)
)
This will do it with sapply
, such that each column is one row in rangesA and each row is corresponding row in rangesB:
> sapply(rangesA$stop, '>=', rangesB$start) & sapply(rangesA$start, '<=', rangesB$stop)
[,1] [,2] [,3]
[1,] FALSE TRUE FALSE
[2,] TRUE FALSE FALSE
[3,] FALSE FALSE TRUE
回答5:
RangesA and RangesB are clearly BED syntax, this can be done outside R in the command line with BEDtools, extremely fast and flexible with a dozen other options to work with genomic intervals. Then put the results back again into R.
https://code.google.com/p/bedtools/
回答6:
I add the dplyr
solution.
library(dplyr)
inner_join(rangesA, rangesB, by="chrom") %>%
filter(start.y < start.x | stop.y > stop.x)
Output:
chrom start.x stop.x start.y stop.y
1 5 100 105 99 106
2 1 200 250 200 265