Identify overlapping ranges in AWK

2019-08-31 15:59发布

问题:

I have a file with rows of 3 columns (tab separated) eg:

2 45 100

And a second file with rows of 3 columns (tab separated) eg:

2 10 200

I want an awk command that matched the lines if $1 in both files matches and the range between $2-$3 in file one interstects at all with the range in $2-$3 in file 2. It can be within the range of values in file 2 or the range in file 2 can be within the range in file 1, or theer can just be a partial overlap. Any kind of intersect between the ranges would count as a match and then print the row in file 3.

My current code only matches if $1 and either $2 or $3 match, but doesn't work for when the ranges are within each other as in these cases the precise numbers don't match.

  awk '
        BEGIN {
            FS = "\t";
        }
        FILENAME == ARGV[1] {
            pair[ $1, $2, $3 ] = 1;
            next;
        }
        {
            if ( pair[ $1, $2, $3 ] == 1 ) {
                print $1 $2 $3;
            }
        }

Example Input:

File1:

1 10 23
2 30 50
6 100 110
8 20 25

File2:

1 5 15
10 30 50
2 10 100
8 22 24

Here line 1(file1) matches line 1(file2) because the first column matches AND range 10-15 overlaps between both ranges Line 2 (file1) matches line 3(file2) because first column matches and range of 30-50 is within range 10-100. Line 4(file1) matches line 4(file2) because first column matches and the range 22-24 overlaps in both. Therefore output would be lines 1,2 and 4 from file2 printed in a new output file.

Hope these examples help.

Your help is really appreciated.

Thank you in advance!

回答1:

It is quite easy if you use join command to merge both files by its first field ($1):

If you only want the file2 lines as output:

join --nocheck-order <(sort -n file1) <(sort -n file2) | awk '{if ($2 >= $4 && $2 <= $5 || $3 >= $4 && $3 <= $5 || $4 >= $2 && $4 <= $3 || $5 >= $2 && $5 <= $3) {print $1" "$4" "$5;}}' -

Using your input files I got this output:

1 5 15
2 10 100
8 22 24