Using awk to print pairs of records having overlap

2019-09-09 20:57发布

问题:

I have different records corresponding to ranges with start($6) and stop($7). What I want to do is to print out all pairs of records having overlapping ranges.

For example, my data is as follow:

id1 0   376 . scaffold1 5165761 5166916 
id2 0   366 . scaffold1 2297244 2298403 
id3 155 456 . scaffold1 692777  693770 
id4 185 403 . scaffold1 102245  729675

What I want is a result like

id3 id4

because the range of id4 is overlapping with id3. I have been searching the solutions all over the internet but it seems there is nothing approaching to my problem.

I would really appreciate if some might give some advice.


After following the advice of some from the below replies, I did try this code which did work !

awk '{start[$1]=$6;stop[$1]=$7;} END {for(i in start) {for(j in stop) {if(start[i] >= start[j] && start[i] <= stop[j]) print i,j}}}' file | awk '{if($1!=$2) print}' -

The processing time was quite short...it was done after not even 1 minute for a file with 1400 records.

回答1:

$ cat tst.awk
{
    beg[$1] = $6
    end[$1] = $7
    ids[++numIds] = $1
}
END {
    for (i=1; i<=numIds; i++) {
        idI = ids[i]
        for (j=1; j<=numIds; j++) {
            idJ = ids[j]
            if (idI != idJ) {
                if ( ( (beg[idI] >= beg[idJ]) && (beg[idI] <= end[idJ]) ) ||
                     ( (end[idI] >= beg[idJ]) && (end[idI] <= end[idJ]) ) ) {
                    if ( !seen[(idI<idJ ? idI FS idJ : idJ FS idI)]++ ) {
                        print idI, idJ
                    }
                }
            }
        }
    }
}

$ awk -f tst.awk file
id3 id4

The input file you provided in your question doesn't cover many cases so given this input file with a lot more overlap variants in it:

$ cat file
id1 185 403 . scaffold1 10  20
id2 185 403 . scaffold1 11  19
id3 185 403 . scaffold1  9  10
id4 185 403 . scaffold1 20  21
id5 185 403 . scaffold1  9  11
id6 185 403 . scaffold1 19  21
id7 185 403 . scaffold1 10  20
id8 185 403 . scaffold1  1   8

Try the above:

$ awk -f tst.awk file
id1 id3
id1 id4
id1 id5
id1 id6
id1 id7
id2 id1
id2 id5
id2 id6
id2 id7
id3 id5
id3 id7
id4 id6
id4 id7
id5 id7
id6 id7

vs the scripts + pipe you provided at the end of your answer:

$ awk '{start[$1]=$6;stop[$1]=$7;} END {for(i in start) {for(j in stop) {if(start[i] >= start[j] && start[i] <= stop[j]) print i,j}}}' file | awk '{if($1!=$2) print}' -
id3 id5
id4 id6
id4 id7
id4 id1
id5 id3
id6 id7
id6 id1
id6 id2
id7 id3
id7 id5
id7 id1
id1 id3
id1 id5
id1 id7
id2 id5
id2 id7
id2 id1

and notice that your scripts report the overlap between some (but not all) of the ids twice:

id1 id7
id7 id1
id3 id5
id5 id3

while my script only reports them once courtesy of !seen[(idI<idJ ? idI FS idJ : idJ FS idI)]++.



回答2:

This solution requires GNU awk:

{
    start = $6 * 10 + 5;
    stop = $7 * 10;
    data[start] = data[start] " " $1;
    data[stop] = data[stop] " " $1;
}
END {
    PROCINFO["sorted_in"] = "@ind_num_asc";
    for (d in data) {
        count = split(data[d], fields);
        for (i in fields) {
            id = fields[i];
            if (d % 10 == 5) { # start
                for (s in started) {
                    print s, id;
                }
                started[id] = 1;
            } else { # stop
                delete started[id];
            }
        }
    }
}

The basic idea is this: put the start and stop markers (I called them indices, possibly a bad choice) in a single array and sort that array by its indices. Then, iterate through the array. If you encounter a "start" marker, put it in another array (called "started"). If you encounter a "stop" marker, remove it from that array. Now, if you encounter a "start" marker, that interval overlaps with all intervals currently in the array "started", so print out the matches. By making sure that the "stop" markers precede the "start" markers with the same original index, you can eliminate corner cases.