I have a file characterizing genomic regions that looks like this:
chrom chromStart chromEnd PGB
chr1 12874 28371 2
chr1 15765 21765 1
chr1 15795 28371 2
chr1 18759 24759 1
chr1 28370 34961 1
chr3 233278 240325 1
chr3 239279 440831 2
chr3 356365 362365 1
Basically PGB describes the category of the genomic region characterised by its chromosome number (chrom), start (chromStart) and end (chromEnd) coordinates.
I wish to collapse the overlapping regions such that overlapping regions of PGB = 1 and 2 are in a new category, PGB = 3. Output being:
chrom chromStart chromEnd PGB
chr1 12874 15764 2
chr1 15765 24759 3
chr1 24760 28369 2
chr1 28370 28371 3
chr1 28372 34961 1
chr3 233278 239278 1
chr3 239279 240325 3
chr3 240326 356364 2
chr3 356365 440831 3
Basically I wish to obtain an output file which reports unique regions. There are a two criteria.
First, if PGB (column 4) is identical between rows, merge range. eg.
chrom chromStart chromEnd PGB
chr1 1 10 1
chr1 5 15 1
output
chrom chromStart chromEnd PGB
chr1 1 15 1
Second, if PGB is different between rows, chr (column 1) is identical, and the ranges overlap (col2 and 3), report overlapping range as PGB = 3 as well as the ranges unique to their individual categories.
eg.
chrom chromStart chromEnd PGB
chr1 30 100 1
chr1 50 150 2
output
chrom chromStart chromEnd PGB
chr1 30 49 1
chr1 50 100 3
chr1 101 150 2
I hope that illustrates the problem better.
I've created a script that I believe accomplishes this goal.
And the result:
Note: I created this mostly as an exercise for myself, but if you use it in any way, please let me know.