there is a data frame with which I am working it looks like this
the two columns denote start and end of a chunk. I need to know how many of these chunks are present at every position from 0 to 23110906. Sometimes the chunks overlap and sometimes there might be a region which has no chunk covering at all. It is like segments in R. but I dont need a visualisation I just need a way to find quickly the number of chunks at every postion. Is there an easy way?
The data structure you are looking for is called interval tree, which is a type of sorted binary tree that contains (guess what) intervals, each of which usually has start and end positions.
I never used an interval tree to store points as you need, but I guess you can define your intervals as
interval.start = interval.end
.Building the tree will take linear time and querying the intervals of your data frame will take logarithmic time, which is much faster than pteetor's quadratic time approach.
The R package IRanges from Bioconductor may help you. I would try the function
findOverlaps()
and thentable()
the results. I invite you to read the documentation to see whether it fits your specific needs.I took that matrix and examined the overlaps, of which there were only five intervals with any overlaps and none with 2, assuming they were ordered by their starting postions:
So which ones were they?
So it seemed rather wasteful of machine resources and time to create a vector that was 23 million items long and it would be a lot easier to simply build a function that would count the number of intervals in which any particular position was within:
These are the intervals where there are 2:
Here's some data
An IRanges notion is
coverage()
Which is a compact run-length encoding; query at the ith location
Do math on the Rle
or coerce to an integer vector
This
will also be reasonably fast.
Note subtle differences between these, from differences in the notion of whether the ends are included (IRanges: yes; tabulate: no) in the range. If these are actually genome coordinates then GenomicRanges is the place to go, to account for seqname (chromosome) and strand.
If you really want the count at every position -- all 23,110,906 positions -- this code will tell you.
But it's very slow. Faster code would require some clever optimization to eliminate the (very) redundant counting down by these two lines.
If you simply want the count at one position,
i
, just callcountChunks(i)
.