I have different records corresponding to ranges with start($6) and stop($7).
What I want to do is to print out all pairs of records having overlapping ranges.
For example, my data is as follow:
id1 0 376 . scaffold1 5165761 5166916
id2 0 366 . scaffold1 2297244 2298403
id3 155 456 . scaffold1 692777 693770
id4 185 403 . scaffold1 102245 729675
What I want is a result like
id3 id4
because the range of id4 is overlapping with id3.
I have been searching the solutions all over the internet but it seems there is nothing approaching to my problem.
I would really appreciate if some might give some advice.
After following the advice of some from the below replies, I did try this code which did work !
awk '{start[$1]=$6;stop[$1]=$7;} END {for(i in start) {for(j in stop) {if(start[i] >= start[j] && start[i] <= stop[j]) print i,j}}}' file | awk '{if($1!=$2) print}' -
The processing time was quite short...it was done after not even 1 minute for a file with 1400 records.
$ cat tst.awk
{
beg[$1] = $6
end[$1] = $7
ids[++numIds] = $1
}
END {
for (i=1; i<=numIds; i++) {
idI = ids[i]
for (j=1; j<=numIds; j++) {
idJ = ids[j]
if (idI != idJ) {
if ( ( (beg[idI] >= beg[idJ]) && (beg[idI] <= end[idJ]) ) ||
( (end[idI] >= beg[idJ]) && (end[idI] <= end[idJ]) ) ) {
if ( !seen[(idI<idJ ? idI FS idJ : idJ FS idI)]++ ) {
print idI, idJ
}
}
}
}
}
}
$ awk -f tst.awk file
id3 id4
The input file you provided in your question doesn't cover many cases so given this input file with a lot more overlap variants in it:
$ cat file
id1 185 403 . scaffold1 10 20
id2 185 403 . scaffold1 11 19
id3 185 403 . scaffold1 9 10
id4 185 403 . scaffold1 20 21
id5 185 403 . scaffold1 9 11
id6 185 403 . scaffold1 19 21
id7 185 403 . scaffold1 10 20
id8 185 403 . scaffold1 1 8
Try the above:
$ awk -f tst.awk file
id1 id3
id1 id4
id1 id5
id1 id6
id1 id7
id2 id1
id2 id5
id2 id6
id2 id7
id3 id5
id3 id7
id4 id6
id4 id7
id5 id7
id6 id7
vs the scripts + pipe you provided at the end of your answer:
$ awk '{start[$1]=$6;stop[$1]=$7;} END {for(i in start) {for(j in stop) {if(start[i] >= start[j] && start[i] <= stop[j]) print i,j}}}' file | awk '{if($1!=$2) print}' -
id3 id5
id4 id6
id4 id7
id4 id1
id5 id3
id6 id7
id6 id1
id6 id2
id7 id3
id7 id5
id7 id1
id1 id3
id1 id5
id1 id7
id2 id5
id2 id7
id2 id1
and notice that your scripts report the overlap between some (but not all) of the ids twice:
id1 id7
id7 id1
id3 id5
id5 id3
while my script only reports them once courtesy of !seen[(idI<idJ ? idI FS idJ : idJ FS idI)]++
.
This solution requires GNU awk
:
{
start = $6 * 10 + 5;
stop = $7 * 10;
data[start] = data[start] " " $1;
data[stop] = data[stop] " " $1;
}
END {
PROCINFO["sorted_in"] = "@ind_num_asc";
for (d in data) {
count = split(data[d], fields);
for (i in fields) {
id = fields[i];
if (d % 10 == 5) { # start
for (s in started) {
print s, id;
}
started[id] = 1;
} else { # stop
delete started[id];
}
}
}
}
The basic idea is this: put the start and stop markers (I called them indices, possibly a bad choice) in a single array and sort that array by its indices. Then, iterate through the array. If you encounter a "start" marker, put it in another array (called "started"). If you encounter a "stop" marker, remove it from that array. Now, if you encounter a "start" marker, that interval overlaps with all intervals currently in the array "started", so print out the matches. By making sure that the "stop" markers precede the "start" markers with the same original index, you can eliminate corner cases.