there is a data frame with which I am working it looks like this
the two columns denote start and end of a chunk. I need to know how many of these chunks are present at every position from 0 to 23110906. Sometimes the chunks overlap and sometimes there might be a region which has no chunk covering at all. It is like segments in R. but I dont need a visualisation I just need a way to find quickly the number of chunks at every postion. Is there an easy way?
Here's some data
m = matrix(c(10, 20, 25, 30), 2)
An IRanges notion is coverage()
> cvg = coverage(IRanges(start=m[,1], end=m[,2]))
> cvg
integer-Rle of length 30 with 4 runs
Lengths: 9 10 6 5
Values : 0 1 2 1
Which is a compact run-length encoding; query at the ith location
> cvg[22]
integer-Rle of length 1 with 1 run
Lengths: 1
Values : 2
> runValue(cvg[22])
[1] 2
Do math on the Rle
> cvg > 1
logical-Rle of length 30 with 3 runs
Lengths: 19 6 5
Values : FALSE TRUE FALSE
or coerce to an integer vector
> as(cvg, "integer")
[1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1
This
> cumsum(tabulate(m[,1], 30)) - cumsum(tabulate(m[,2], 30))
[1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 0
will also be reasonably fast.
Note subtle differences between these, from differences in the notion of whether the ends are included (IRanges: yes; tabulate: no) in the range. If these are actually genome coordinates then GenomicRanges is the place to go, to account for seqname (chromosome) and strand.
The data structure you are looking for is called interval tree, which is a type of sorted binary tree that contains (guess what) intervals, each of which usually has start and end positions.
I never used an interval tree to store points as you need, but I guess you can define your intervals as interval.start = interval.end
.
Building the tree will take linear time and querying the intervals of your data frame will take logarithmic time, which is much faster than pteetor's quadratic time approach.
The R package IRanges from Bioconductor may help you. I would try the function findOverlaps()
and then table()
the results. I invite you to read the documentation to see whether it fits your specific needs.
I took that matrix and examined the overlaps, of which there were only five intervals with any overlaps and none with 2, assuming they were ordered by their starting postions:
> sum( mat[1:28,2] > mat[2:29,1] )
[1] 5
> sum( mat[1:27,2] > mat[3:29,1] )
[1] 0
So which ones were they?
> which( mat[1:28,2] > mat[2:29,1] )
[1] 19 21 23 25 28
So it seemed rather wasteful of machine resources and time to create a vector that was 23 million items long and it would be a lot easier to simply build a function that would count the number of intervals in which any particular position was within:
fchunk <- function(pos) {sum( mat[ , 1] <= pos & mat[,2] >= pos)}
#--------
> fchunk(16675330)
[1] 2
> fchunk(16675329)
[1] 1
These are the intervals where there are 2:
sapply( which( mat[1:28,2] > mat[2:29,1] ) ,
function(int1) c( mat[int1+1, 1], mat[int1, 2] ) )
#--------
[,1] [,2] [,3] [,4] [,5]
n7 16675330 18097680 20233612 21288777 22847516
n8 16724700 18445265 20741145 22780817 22967567
If you really want the count at every position -- all 23,110,906 positions -- this code will tell you.
countChunks = function(i) sum(dfrm$n7 <= i & i <= dfrm$n8)
counts = sapply(1:23110906, countChunks)
But it's very slow. Faster code would require some clever optimization to eliminate the (very) redundant counting down by these two lines.
If you simply want the count at one position, i
, just call countChunks(i)
.