overlapping segments R

there is a data frame with which I am working it looks like this

the two columns denote start and end of a chunk. I need to know how many of these chunks are present at every position from 0 to 23110906. Sometimes the chunks overlap and sometimes there might be a region which has no chunk covering at all. It is like segments in R. but I dont need a visualisation I just need a way to find quickly the number of chunks at every postion. Is there an easy way?

标签： r dataframe segments

4条回答

做个烂人

2楼-- · 2019-08-06 05:10

The data structure you are looking for is called interval tree, which is a type of sorted binary tree that contains (guess what) intervals, each of which usually has start and end positions.

I never used an interval tree to store points as you need, but I guess you can define your intervals as interval.start = interval.end.

Building the tree will take linear time and querying the intervals of your data frame will take logarithmic time, which is much faster than pteetor's quadratic time approach.

The R package IRanges from Bioconductor may help you. I would try the function findOverlaps() and then table() the results. I invite you to read the documentation to see whether it fits your specific needs.

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

3楼-- · 2019-08-06 05:13

I took that matrix and examined the overlaps, of which there were only five intervals with any overlaps and none with 2, assuming they were ordered by their starting postions:

> sum( mat[1:28,2] > mat[2:29,1] )
[1] 5
> sum( mat[1:27,2] > mat[3:29,1] )
[1] 0

So which ones were they?

> which( mat[1:28,2] > mat[2:29,1] )
[1] 19 21 23 25 28

So it seemed rather wasteful of machine resources and time to create a vector that was 23 million items long and it would be a lot easier to simply build a function that would count the number of intervals in which any particular position was within:

 fchunk <- function(pos) {sum( mat[ , 1] <= pos & mat[,2] >= pos)}
#--------
> fchunk(16675330)
[1] 2
> fchunk(16675329)
[1] 1

These are the intervals where there are 2:

sapply( which( mat[1:28,2] > mat[2:29,1] ) , 
       function(int1) c( mat[int1+1, 1], mat[int1, 2] ) )
#--------
       [,1]     [,2]     [,3]     [,4]     [,5]
n7 16675330 18097680 20233612 21288777 22847516
n8 16724700 18445265 20741145 22780817 22967567

0人赞添加讨论(0) 举报

Viruses.

4楼-- · 2019-08-06 05:17

Here's some data

m = matrix(c(10, 20, 25, 30), 2)

An IRanges notion is coverage()

> cvg = coverage(IRanges(start=m[,1], end=m[,2]))
> cvg
integer-Rle of length 30 with 4 runs
  Lengths:  9 10  6  5
  Values :  0  1  2  1

Which is a compact run-length encoding; query at the ith location

> cvg[22]
integer-Rle of length 1 with 1 run
  Lengths: 1
  Values : 2
> runValue(cvg[22])
[1] 2

Do math on the Rle

> cvg > 1
logical-Rle of length 30 with 3 runs
  Lengths:    19     6     5
  Values : FALSE  TRUE FALSE

or coerce to an integer vector

> as(cvg, "integer")
 [1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1

This

> cumsum(tabulate(m[,1], 30)) - cumsum(tabulate(m[,2], 30))
 [1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 0

will also be reasonably fast.

Note subtle differences between these, from differences in the notion of whether the ends are included (IRanges: yes; tabulate: no) in the range. If these are actually genome coordinates then GenomicRanges is the place to go, to account for seqname (chromosome) and strand.

0人赞添加讨论(0) 举报

虎瘦雄心在

5楼-- · 2019-08-06 05:25

If you really want the count at every position -- all 23,110,906 positions -- this code will tell you.

countChunks = function(i) sum(dfrm$n7 <= i & i <= dfrm$n8)
counts = sapply(1:23110906, countChunks)

But it's very slow. Faster code would require some clever optimization to eliminate the (very) redundant counting down by these two lines.

If you simply want the count at one position, i, just call countChunks(i).

0人赞添加讨论(0) 举报

overlapping segments R

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间