I have a huge input file (a representative sample of which is shown below as input
):
> input
CT1 CT2 CT3
1 chr1:200-400 chr1:250-450 chr1:400-800
2 chr1:800-970 chr2:200-500 chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400
I want to process it by following some rules (described below) so that I get an output
like:
> output
CT1 CT2 CT3
chr1:200-400 1 1 0
chr1:800-970 1 0 0
chr2:300-700 1 1 0
chr1:250-450 1 1 0
chr2:200-500 1 1 0
chr2:600-1000 0 1 1
chr1:400-800 0 0 1
chr1:700-870 0 1 1
chr2:700-1400 0 1 1
Rules:
Take every index (the first in this case is chr1:200-400
), see if it overlap significantly with values in other column. From significantly I mean atleast 50% overlap of the range. If yes, write 1
below that column in which it exists, if not write 0
.
Now I explain how I got the output table. From our input we take 1st index input[1,1] which is chr1:200-400
. As it exists in column 1 we will write 1 below it. Now we will check if this or an overlapping range exists in any other column. This value overlaps only with the first value (chr1:250-450
) of the second column (CT2
). Now we check if this overlap is significant or not. We know that range is 200 (chr1:200-400), the overlap with 2nd column value (which is chr1:250-450) is 150 (250-400). As this overlap of 150 is greater than the half (50% of original range = 100) of original range (200-400 = 200) OR overlapping range (250-450 = 200). We consider it an overlap and assign 1 below the column CT2 as well. As this range does not overlap with any value in CT3
, we write 0 below CT3
.
Similarly for row 9 of output. chr2:700-1400
doesn't exist in CT1
so write 0 below it. For CT2
it overlaps with chr2:600-1000
. The original range here is 700 (chr2:700-1400), half of it is 350. The overlap with chr2:700-1000
of CT2
is 300 (out of actual range chr2:600-1000). Now this overlap of 300 is not greater than the half of actual range 700 (chr2:700-1400 of CT3) but it is greater than the half of overlapping range 400 (chr2:600-1000 of CT2). Therefore, we consider it an overlap and write 1 below CT2. As this range actually exists in CT3, we write 1 below it as well.
Here are the dput of input and output:
> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400",
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450",
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800",
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1",
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L,
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L,
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400",
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500",
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))
To do this requires many steps, and a number of concepts from the
data.table
package, most notably non-equi joins. I've commented the code throughout, and recommend running it step by step if you want more insight:This will generate (almost) the same output you've provided:
The only difference is that chr1:700-870 has a 0 in the CT2 column: this is because it actually doesn't overlap any of the genomic windows in CT2, the only other window on chromosome 1 was chr1:250-450.