Background: I have a Perl script that I wrote to go through two files. The basic point of the script is to identify overlaps between one list of coordinates, defining the beginnings and ends of randomly selected chromosomal segments, and a second list of coordinates, defining the beginnings and endings of actual gene transcripts.
The first input file contains three columns. The first is for the chromosome number, and the second and third are the proximal and distal coordinates, in base pairs, of the randomly selected regions. For eg,
chr1 1100349 2035647
chr1 47837656 736474584
. . .
. . .
. . .
The second input file contains four columns: chromosome number, proximal coordinate, distal coordinate, and the name of the gene. For eg,
chr1 1588354 2283765 geneA
chr1 55943837 787653743 geneB
Here is a set of test files I used to start off with. First set.
chr1 1 10
chr1 5 10
chr1 5 15
chr1 14 15
chr1 100 101
chr1 11 17
Second set.
chr1 1 5 geneA
chr1 7 10 geneB
chr1 12 16 geneC
chr1 18 21 geneD
chr10 126602211 126609396 B4galnt1
The script reads off the first line from the first list, then reads through all the lines of the second list, and prints for me whether and how the first coordinate pair overlaps with the second coordinate pair (Is the first coordinate pair outside the second pair? Is the first pair inside or overlapping with the second?) Then, the script goes back and reads off the second line from the first list, and repeats the process. The first file has 200,000 lines. The second several thousand. It is running now overnight.
The problem: When the script determines the relationship between the first and second coordinate pairs, it prints out a line to an output file. Not all these print statements need to be sent to output, so I tried to comment them out. However, when I did this, none of the print statements sending information to the output file got printed. Statements are printed to the screen, though, just not to the output file. The script is running, but all the print to output statements are being used, so the output file is getting huge. If the script would just print to output for only those coordinates that overlap, the output file would be very, very much smaller. At present, the output file is now 2,131,294 KB! And that's only up to chromosome 11. There are eight more to go through, albeit smaller ones, but the file size is still going to expand greatly.
Updated information: This is edited in after my original posting. To be more precise, it is only when I comment out the first print $output "...";
statement that is inside the loop (the very first statement is to print a header, and this is before the loop) that the script fails to print anything, even when all the others are left alone (not commented).
In case it matters: I wrote the script on my Mac, using Fraise, but I am running it on a PC, the script contained in a Notepad text file.
Here's the script: Note: there are many print statements in the file, many commented out. The print statements of interest are those printing to the output file. Those are the ones that, when one or more are commented out, wind up never sending information to the output file. Those statements look like:
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tinside\n";
The actual script:
#!/bin/usr/perl
use strict; use warnings;
#############
## findGenes_after_ASboot_v5.pl
#############
#############
# After making a big list of randomly placed intervals,
# this script uses RefGene.txt file and identifies the
# the gene symbols encompassed or overlapped by each random interval
#############
unless(scalar @ARGV == 2) {
# $0 name of the program being executed;
print "\n usage: $0 filename containig your list of positions and a RefGene-type file \n\n";
exit;
}
#for ( my $i = 0; $i < 25; $i++ ){
# print "#########################################\n";
#}
open( my $positions, "<", $ARGV[0] ) or die;
open( my $RefGene, "<", $ARGV[1] ) or die;
open( my $output, ">>", "output.txt") or die;
# print header
print $output "chr\tpos count\tpos1\tpos2\tchr\tref count\tref1\tref2\tname2\trelationship\n";
my $pos_count = 1;
my $ref_count = 1;
for my $position_line (<$positions>) {
#print "$position_line";
my @posline = split('\t', $position_line);
#print "$posline[0]\t$posline[1]\t$posline[2]";
open( my $RefGene, "<", $ARGV[1] ) or die;
for my $ref (<$RefGene>){
#print "\t$ref";
my @refline = split('\t', $ref);
# print "\t$refline[0]\t$refline[1]\t$refline[2]\t$refline[3]";
chomp $posline[2];
chomp $refline[3];
if ( $posline[0] eq $refline[0] ){
#print "\tchr match\n";
# am i entirely prox to a gene?
if ( $posline[2] < $refline[1] ){
#print "too proximal\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo proximal\n";
#the following print statement is one I'd like to be able to comment out
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo proximal\n";
$ref_count++;
next;
}
# am i entirely distal to a gene?
elsif ( $posline[1] > $refline[2] ){
#print "too distal\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo distal\n";
#the following print statement is one I'd like to be able to comment out
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo distal\n";
$ref_count++;
next;
}
# am i completely inside a gene?
elsif ( $posline[1] >= $refline[1] &&
$posline[2] <= $refline[2] ){
#print "inside\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tinside\n";
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tinside\n";
$ref_count++;
next;
}
# am i proximally overlapping?
elsif ( $posline[1] < $refline[1] &&
$posline[2] <= $refline[2] ){
#print "proximal overlap\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tproximal overlap\n";
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tproximal overlap\n";
$ref_count++;
next;
}
# am i distally overlapping?
elsif ( $posline[1] >= $refline[1] &&
$posline[2] > $refline[2] ){
#print "distal overlap\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tdistal overlap\n";
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tdistal overlap\n";
$ref_count++;
next;
}
else {
#print "encompassing\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tencompassing\n";
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tencompassing\n";
$ref_count++;
next;
}
} # if a match with chr
else {
next;
}
} # for each reference
$pos_count++;
} # for each position
Data Files:
- http://www.filedropper.com/proxdistalpositionsofrandompositions
- http://www.filedropper.com/modifiedrefgene
- Some output: http://www.filedropper.com/output_17