fileA
contains intervals (start, end), and a value assigned to that interval (value).
start end value
0 123 1 #value 1 at positions 0 to 122 included.
123 78000 0 #value 0 at positions 123 to 77999 included.
78000 78004 56 #value 56 at positions 78000, 78001, 78002 and 78003.
78004 78005 12 #value 12 at position 78004.
78005 78006 1 #value 1 at position 78005.
78006 78008 21 #value 21 at positions 78006 and 78007.
78008 78056 8 #value 8 at positions 78008 to 78055 included.
78056 81000 0 #value 0 at positions 78056 to 80999 included.
fileB
contains a list of the intervals I am interested in. I would like to retrieve the overlapping intervals from fileA
. The starts and ends do not necessarily match. Here is an example of fileB
:
start end label
77998 78005 romeo
78007 78012 juliet
The goal is to (1) retrieve the intervals from fileA
that overlap with fileB
and (2) to append the corresponding labels from fileB
. The expected result is (# to designate the lines that were discarded, this is to help visualize and will not be in the final output):
start end value label
#
123 78000 0 romeo
78000 78004 56 romeo
78004 78005 12 romeo
#
78006 78008 21 juliet
78008 78056 8 juliet
#
Here is my attempt at writing code:
#read from tab-delimited text files which do not contain column names
A<-read.table("fileA.txt",sep="\t",colClasses=c("numeric","numeric","numeric"))
B<-read.table("fileB.txt",sep="\t",colClasses=c("numeric","numeric","character"))
#add column names
colnames(A)<-c("start","end","value")
colnames(B)<-c("start","end","label")
#output intervals in `fileA` that overlap with an interval in `fileB`
A_overlaps<-A[((A$start <= B$start & A$end >= B$start)
|(A$start >= B$start & A$start <= B$end)
|(A$end >= B$start & A$end <= B$end)),]
At this point I am already getting unexpected results:
> A_overlaps
start end value
#missing
3 78000 78004 56
5 78005 78006 1 #this line should not be here
6 78006 78008 21
#missing
I didn't write the part to output the labels yet because I might as well fix this first, but I can't figure out what I am getting wrong...
[EDIT]
I also tried the following but it just outputs the entirety of fileA
:
A_overlaps <- A[(min(A$start,A$end) < max(B$start,B$end)
& max(A$start,A$end) > min(B$start,B$end)),]
This produces desired output, but may be a little difficult to read