extract overlapping regions

I have a file characterizing genomic regions that looks like this:

chrom   chromStart  chromEnd    PGB
chr1    12874   28371   2
chr1    15765   21765   1
chr1    15795   28371   2
chr1    18759   24759   1
chr1    28370   34961   1
chr3    233278  240325  1
chr3    239279  440831  2
chr3    356365  362365  1

Basically PGB describes the category of the genomic region characterised by its chromosome number (chrom), start (chromStart) and end (chromEnd) coordinates.

I wish to collapse the overlapping regions such that overlapping regions of PGB = 1 and 2 are in a new category, PGB = 3. Output being:

chrom   chromStart  chromEnd    PGB
chr1    12874   15764   2
chr1    15765   24759   3
chr1    24760   28369   2
chr1    28370   28371   3
chr1    28372   34961   1
chr3    233278  239278  1
chr3    239279  240325  3
chr3    240326  356364  2
chr3    356365  440831  3

Basically I wish to obtain an output file which reports unique regions. There are a two criteria.

First, if PGB (column 4) is identical between rows, merge range. eg.

chrom   chromStart  chromEnd    PGB
chr1    1   10  1
chr1    5   15  1

output

chrom   chromStart  chromEnd    PGB
chr1    1   15  1

Second, if PGB is different between rows, chr (column 1) is identical, and the ranges overlap (col2 and 3), report overlapping range as PGB = 3 as well as the ranges unique to their individual categories.

eg.

chrom   chromStart  chromEnd    PGB
chr1    30  100 1
chr1    50  150 2

output

chrom   chromStart  chromEnd    PGB
chr1    30  49  1
chr1    50  100 3
chr1    101 150 2

I hope that illustrates the problem better.

标签： perl awk bioinformatics bioperl

1条回答

Viruses.

2楼-- · 2019-06-13 06:16

I've created a script that I believe accomplishes this goal.

use List::Util qw(max);

use strict;
use warnings;

# Skip Header row
<DATA>;

# With only 60k rows, going to just slurp all the data
my %chroms;
while (<DATA>) {
    chomp;
    my ($chrom, $start, $end, $pgb) = split ' ';

    # Basic Data Validation.
    warn "chromStart is NaN, '$start'" if $start =~ /\D/;
    warn "chromEnd is NaN, '$end'" if $end =~ /\D/;
    warn "range not ordered, '$start' to '$end'" if $start > $end;
    warn "PGB only can be 1 or 2, '$pgb'" if ($pgb ne '1' && $pgb ne '2');

    push @{$chroms{$chrom}{$pgb}}, [$start, $end];
}

print "chrom    chromStart  chromEnd    PGB\n";

# Process each Chrom
for my $chrom (sort keys %chroms) {
    for my $pgb (sort keys %{$chroms{$chrom}}) {
        my $ranges = $chroms{$chrom}{$pgb};

        # Sort Ranges
        @$ranges = sort {$a->[0] <=> $b->[0] || $a->[1] <=> $b->[1]} @$ranges;

        # Combine overlapping and continguous ranges.
        # - Note because we're dealing with integer ranges, 1-4 & 5-8 are contiguous
        for my $i (0..$#$ranges-1) {
            if ($ranges->[$i][1] >= $ranges->[$i+1][0] - 1) {
                $ranges->[$i+1][0] = $ranges->[$i][0];
                $ranges->[$i+1][1] = max($ranges->[$i][1], $ranges->[$i+1][1]);
                $ranges->[$i] = undef;
            }
        }
        @$ranges = grep {$_} @$ranges;
    }

    # Create pgb=3 for overlaps.
    # - Save old ranges into aliases, and then start fresh
    my $pgb1array = $chroms{$chrom}{1};
    my $pgb2array = $chroms{$chrom}{2};
    my @ranges;

    # Always working on the first range in each array, until one of the arrays is empty
    while (@$pgb1array && @$pgb2array) {
        # Aliases to first element
        my $pgb1 = $pgb1array->[0];
        my $pgb2 = $pgb2array->[0];

        # PGB1 < PGB2
        if ($pgb1->[1] < $pgb2->[0]) {
            push @ranges, [@{shift @$pgb1array}, 1]

        # PGB2 < PGB1
        } elsif ($pgb2->[1] < $pgb1->[0]) {
            push @ranges, [@{shift @$pgb2array}, 2]

        # There's overlap for all rest 
        } else {
            # PGB1start < PGB2start
            if ($pgb1->[0] < $pgb2->[0]) {
                push @ranges, [$pgb1->[0], $pgb2->[0]-1, 1];
                $pgb1->[0] = $pgb2->[0];

            # PGB2start < PGB1start
            } elsif ($pgb2->[0] < $pgb1->[0]) {
                push @ranges, [$pgb2->[0], $pgb1->[0]-1, 2];
                $pgb2->[0] = $pgb1->[0];
            }
            # (Starts are equal now)

            # PGB1end < PGB2end
            if ($pgb1->[1] < $pgb2->[1]) {
                $pgb2->[0] = $pgb1->[1] + 1;
                push @ranges, [@{shift @$pgb1array}, 3];

            # PGB2end < PGB1end
            } elsif ($pgb2->[1] < $pgb1->[1]) {
                $pgb1->[0] = $pgb2->[1] + 1;
                push @ranges, [@{shift @$pgb2array}, 3];

            # PGB2end = PGB1end
            } else {
                push @ranges, [@$pgb1, 3];
                shift @$pgb1array;
                shift @$pgb2array;
            }
        }
    }

    # Append whichever is left over
    push @ranges, map {$_->[2] = 1; $_} @$pgb1array;
    push @ranges, map {$_->[2] = 2; $_} @$pgb2array;

    printf "%-8s %-11s %-11s %s\n", $chrom, @$_ for (@ranges);
}

1;

__DATA__
chrom   chromStart  chromEnd    PGB
chr1    12871   12873   2
chr1    12874   28371   2
chr1    15765   21765   1
chr1    15795   28371   2
chr1    18759   24759   1
chr1    28370   34961   1
chr3    233278  240325  1
chr3    239279  440831  2
chr3    356365  362365  1

And the result:

chrom    chromStart  chromEnd    PGB
chr1     12871       15764       2
chr1     15765       24759       3
chr1     24760       28369       2
chr1     28370       28371       3
chr1     28372       34961       1
chr3     233278      239278      1
chr3     239279      240325      3
chr3     240326      356364      2
chr3     356365      362365      3
chr3     362366      440831      2

Note: I created this mostly as an exercise for myself, but if you use it in any way, please let me know.

0人赞添加讨论(0) 举报

extract overlapping regions

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间