I have a huge input file (a representative sample of which is shown below as input
):
> input
CT1 CT2 CT3
1 chr1:200-400 chr1:250-450 chr1:400-800
2 chr1:800-970 chr2:200-500 chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400
I want to process it by following some rules (described below) so that I get an output
like:
> output
CT1 CT2 CT3
chr1:200-400 1 1 0
chr1:800-970 1 0 0
chr2:300-700 1 1 0
chr1:250-450 1 1 0
chr2:200-500 1 1 0
chr2:600-1000 0 1 1
chr1:400-800 0 0 1
chr1:700-870 0 1 1
chr2:700-1400 0 1 1
Rules:
Take every index (the first in this case is chr1:200-400
), see if it overlap significantly with values in other column. From significantly I mean atleast 50% overlap of the range. If yes, write 1
below that column in which it exists, if not write 0
.
Now I explain how I got the output table. From our input we take 1st index input[1,1] which is chr1:200-400
. As it exists in column 1 we will write 1 below it. Now we will check if this or an overlapping range exists in any other column. This value overlaps only with the first value (chr1:250-450
) of the second column (CT2
). Now we check if this overlap is significant or not. We know that range is 200 (chr1:200-400), the overlap with 2nd column value (which is chr1:250-450) is 150 (250-400). As this overlap of 150 is greater than the half (50% of original range = 100) of original range (200-400 = 200) OR overlapping range (250-450 = 200). We consider it an overlap and assign 1 below the column CT2 as well. As this range does not overlap with any value in CT3
, we write 0 below CT3
.
Similarly for row 9 of output. chr2:700-1400
doesn't exist in CT1
so write 0 below it. For CT2
it overlaps with chr2:600-1000
. The original range here is 700 (chr2:700-1400), half of it is 350. The overlap with chr2:700-1000
of CT2
is 300 (out of actual range chr2:600-1000). Now this overlap of 300 is not greater than the half of actual range 700 (chr2:700-1400 of CT3) but it is greater than the half of overlapping range 400 (chr2:600-1000 of CT2). Therefore, we consider it an overlap and write 1 below CT2. As this range actually exists in CT3, we write 1 below it as well.
Here are the dput of input and output:
> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400",
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450",
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800",
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1",
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L,
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L,
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400",
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500",
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))
To do this requires many steps, and a number of concepts from the data.table
package, most notably non-equi joins. I've commented the code throughout, and recommend running it step by step if you want more insight:
library(data.table)
input <- structure(list(CT1 = structure(1:3, .Label =
c("chr1:200-400", "chr1:800-970", "chr2:300-700"), class =
"factor"), CT2 = structure(1:3, .Label = c("chr1:250-450",
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 =
structure(1:3, .Label = c("chr1:400-800", "chr1:700-870",
"chr2:700-1400"), class = "factor")), .Names = c("CT1",
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
output <- structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L),
CT2 = c(1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L,
0L, 0L, 1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class =
"data.frame", row.names = c("chr1:200-400", "chr1:800-970",
"chr2:300-700", "chr1:250-450", "chr2:200-500", "chr2:600-1000",
"chr1:400-800", "chr1:700-870", "chr2:700-1400"))
# Builds a data.table by breaking a string like "chr1:300-700" into
# three columns: chr, start, and end.
split_genomic_range <- function(str) {
chr <- gsub(":.*", "", str)
start <- gsub("-.*", "", gsub(".*:", "", str))
end <- gsub(".*-", "", str)
start <- as.numeric(start)
end <- as.numeric(end)
return(data.table(chr=chr, start=start, end=end))
}
# First break the input data.table into three new tables - we will need
# to perform non-equi joins of the index table (column CT1 in input) to
# the tables built from the other two columns.
ct1 <- split_genomic_range(input$CT1)
ct2 <- split_genomic_range(input$CT2)
ct3 <- split_genomic_range(input$CT3)
# Create an index table with all genomic ranges, then check for
# overlaps in each of the three tables created from the input
# columns:
index_table <- unique(rbind(ct1, ct2, ct3))
# Returns entries from the index_table if they overlap > 50% any
# entries in the lookup table or vice-versa
get_overlapping_ranges <- function(index_table, lookup_table) {
# This function does two non-equi joins. First, it checks whether
# any entries in the index_table have a 50% overlap with any
# entries in the lookup table. Second, it does the reverse, and
# checks whether any entries in the lookup_table have a 50% overlap
# with any entries in the index_table. This is required due to
# differing window sizes:
# e.g. 0-20 significantly overlaps 10-100, but 10-100 does not
# significantly overlap 0-20.
# We will need to create a "middle" column for each genomic range.
# We will need to create copies of each table to do this, otherwise
# they will end up with this new column as a side effect of the
# function call.
index_copy <- copy(index_table)
lookup_copy <- copy(lookup_table)
index_copy[, middle := start + (end - start) * 0.5]
lookup_copy[, middle := start + (end - start) * 0.5]
# In the index_copy we will also need to create dummy columns for
# the check. We need to do this so we can find the appropriate
# genomic window from the index table when we do the second
# non-equi join, otherwise the start and end columns will be
# clobbered.
index_copy[, start_index := start]
index_copy[, end_index := end]
# If the middle of a genomic range in the index table falls within
# a genomic range in the lookup table, then that genomic entry from
# the index table has a significant overlap with the corresponding
# in the lookup table
index_overlaps <- index_copy[lookup_table,
on=.(chr, middle >= start, middle <= end),
nomatch=0]
# Test the reverse: does any entry in the lookup table
# significantly overlap with any of the genomic windows in the
# index table?
lookup_overlaps <- index_copy[lookup_copy,
on=.(chr, start_index <= middle, end_index >= middle),
nomatch=0]
# Remove extra columns created by the non-equi join:
index_overlaps <- index_overlaps[,.(chr, start, end)]
lookup_overlaps <- lookup_overlaps[,.(chr, start, end)]
# Combine results and remove any duplicates that arise, e.g.
# because a genomic window in the index_table significantly
# overlaps with multiple genomic windows in the lookup table, or
# because the overlap is significant in both comparisons (i.e.
# where both windows are the same size).
overlaps <- unique(rbind(index_overlaps, lookup_overlaps))
return(overlaps)
}
ranges_in_ct1 <- get_overlapping_ranges(index_table, ct1)
ranges_in_ct2 <- get_overlapping_ranges(index_table, ct2)
ranges_in_ct3 <- get_overlapping_ranges(index_table, ct3)
# Combine results, indicating which column each genomic range was
# found to overlap:
overlaps <- rbind(
CT1=ranges_in_ct1, CT2=ranges_in_ct2, CT3=ranges_in_ct3,
idcol="input_column"
)
# Recombine the chr, start, and end columns to the original format:
overlaps[, genomic_window := paste0(chr, ":", start, "-", end)]
overlaps[, c("chr", "start", "end") := NULL]
# Convert to the wide format, so that each input column either has a
# 1 or 0 if the genomic window overlaps with 50% any other found in
# that column
overlaps <- dcast(overlaps, genomic_window ~ input_column,
fun.aggregate = length)
# Convert back to a data.frame:
overlaps <- as.data.frame(overlaps)
rownames(overlaps) <- overlaps$genomic_window
overlaps <- overlaps[,-1]
# Reorder so we can double check against the desired output:
overlaps <- overlaps[rownames(output),]
print(overlaps)
This will generate (almost) the same output you've provided:
CT1 CT2 CT3
chr1:200-400 1 1 0
chr1:800-970 1 0 0
chr2:300-700 1 1 0
chr1:250-450 1 1 0
chr2:200-500 1 1 0
chr2:600-1000 0 1 1
chr1:400-800 0 0 1
chr1:700-870 0 0 1
chr2:700-1400 0 1 1
The only difference is that chr1:700-870 has a 0 in the CT2 column: this is because it actually doesn't overlap any of the genomic windows in CT2, the only other window on chromosome 1 was chr1:250-450.