I have different records corresponding to ranges with start($6) and stop($7). What I want to do is to print out all pairs of records having overlapping ranges.
For example, my data is as follow:
id1 0 376 . scaffold1 5165761 5166916
id2 0 366 . scaffold1 2297244 2298403
id3 155 456 . scaffold1 692777 693770
id4 185 403 . scaffold1 102245 729675
What I want is a result like
id3 id4
because the range of id4 is overlapping with id3. I have been searching the solutions all over the internet but it seems there is nothing approaching to my problem.
I would really appreciate if some might give some advice.
After following the advice of some from the below replies, I did try this code which did work !
awk '{start[$1]=$6;stop[$1]=$7;} END {for(i in start) {for(j in stop) {if(start[i] >= start[j] && start[i] <= stop[j]) print i,j}}}' file | awk '{if($1!=$2) print}' -
The processing time was quite short...it was done after not even 1 minute for a file with 1400 records.
The input file you provided in your question doesn't cover many cases so given this input file with a lot more overlap variants in it:
Try the above:
vs the scripts + pipe you provided at the end of your answer:
and notice that your scripts report the overlap between some (but not all) of the ids twice:
while my script only reports them once courtesy of
!seen[(idI<idJ ? idI FS idJ : idJ FS idI)]++
.This solution requires GNU
awk
:The basic idea is this: put the start and stop markers (I called them indices, possibly a bad choice) in a single array and sort that array by its indices. Then, iterate through the array. If you encounter a "start" marker, put it in another array (called "started"). If you encounter a "stop" marker, remove it from that array. Now, if you encounter a "start" marker, that interval overlaps with all intervals currently in the array "started", so print out the matches. By making sure that the "stop" markers precede the "start" markers with the same original index, you can eliminate corner cases.