Curiously behaving IF block in Perl run on Windows

2019-09-02 05:09发布

问题:

Background: I have a Perl script that I wrote to go through two files. The basic point of the script is to identify overlaps between one list of coordinates, defining the beginnings and ends of randomly selected chromosomal segments, and a second list of coordinates, defining the beginnings and endings of actual gene transcripts.

The first input file contains three columns. The first is for the chromosome number, and the second and third are the proximal and distal coordinates, in base pairs, of the randomly selected regions. For eg,

chr1    1100349    2035647
chr1    47837656   736474584
.       .          .
.       .          .
.       .          .

The second input file contains four columns: chromosome number, proximal coordinate, distal coordinate, and the name of the gene. For eg,

chr1    1588354    2283765    geneA
chr1    55943837   787653743    geneB

Here is a set of test files I used to start off with. First set.

chr1    1   10
chr1    5   10
chr1    5   15
chr1    14  15
chr1    100 101
chr1    11  17

Second set.

chr1    1   5   geneA
chr1    7   10  geneB
chr1    12  16  geneC
chr1    18  21  geneD
chr10   126602211   126609396   B4galnt1

The script reads off the first line from the first list, then reads through all the lines of the second list, and prints for me whether and how the first coordinate pair overlaps with the second coordinate pair (Is the first coordinate pair outside the second pair? Is the first pair inside or overlapping with the second?) Then, the script goes back and reads off the second line from the first list, and repeats the process. The first file has 200,000 lines. The second several thousand. It is running now overnight.

The problem: When the script determines the relationship between the first and second coordinate pairs, it prints out a line to an output file. Not all these print statements need to be sent to output, so I tried to comment them out. However, when I did this, none of the print statements sending information to the output file got printed. Statements are printed to the screen, though, just not to the output file. The script is running, but all the print to output statements are being used, so the output file is getting huge. If the script would just print to output for only those coordinates that overlap, the output file would be very, very much smaller. At present, the output file is now 2,131,294 KB! And that's only up to chromosome 11. There are eight more to go through, albeit smaller ones, but the file size is still going to expand greatly.

Updated information: This is edited in after my original posting. To be more precise, it is only when I comment out the first print $output "..."; statement that is inside the loop (the very first statement is to print a header, and this is before the loop) that the script fails to print anything, even when all the others are left alone (not commented).

In case it matters: I wrote the script on my Mac, using Fraise, but I am running it on a PC, the script contained in a Notepad text file.

Here's the script: Note: there are many print statements in the file, many commented out. The print statements of interest are those printing to the output file. Those are the ones that, when one or more are commented out, wind up never sending information to the output file. Those statements look like:

print $output  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tinside\n";

The actual script:

#!/bin/usr/perl
use strict; use warnings;

#############
## findGenes_after_ASboot_v5.pl
#############

#############
#  After making a big list of randomly placed intervals,
#  this script uses RefGene.txt file and identifies the 
#  the gene symbols encompassed or overlapped by each random interval 
#############

unless(scalar @ARGV == 2) {
    # $0 name of the program being executed;
    print "\n usage: $0 filename containig your list of positions and a RefGene-type file \n\n"; 
    exit;
}

#for ( my $i = 0; $i < 25; $i++ ){
#     print "#########################################\n";
#}

open( my $positions, "<", $ARGV[0] ) or die;
open( my $RefGene,   "<", $ARGV[1] ) or die;

open( my $output, ">>", "output.txt") or die;

# print header
print $output "chr\tpos count\tpos1\tpos2\tchr\tref count\tref1\tref2\tname2\trelationship\n";

my $pos_count = 1;
my $ref_count = 1;

for my $position_line (<$positions>) {
    #print "$position_line";
    my @posline = split('\t', $position_line);
    #print "$posline[0]\t$posline[1]\t$posline[2]";
    open( my $RefGene,   "<", $ARGV[1] ) or die;

    for my $ref (<$RefGene>){
        #print "\t$ref";    
        my @refline = split('\t', $ref);
        # print "\t$refline[0]\t$refline[1]\t$refline[2]\t$refline[3]";
        chomp $posline[2];
        chomp $refline[3];     
        if ( $posline[0] eq $refline[0] ){
            #print "\tchr match\n";

            # am i entirely prox to a gene?
            if ( $posline[2] < $refline[1] ){
                #print "too proximal\n";
                print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo proximal\n";

                #the following print statement is one I'd like to be able to comment out
                print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo proximal\n";
                $ref_count++; 
                next; 
            }

            # am i entirely distal to a gene?
            elsif ( $posline[1] > $refline[2] ){
                #print "too distal\n";
                print  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo distal\n";
                #the following print statement is one I'd like to be able to comment out
                print $output  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo distal\n";
                $ref_count++; 
                next; 
            }

            # am i completely inside a gene?
            elsif ( $posline[1] >= $refline[1] &&
                $posline[2] <= $refline[2]    ){
                #print "inside\n";
                print  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tinside\n";
                print $output  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tinside\n";
                $ref_count++; 
                next; 
            }

            # am i proximally overlapping?
            elsif ( $posline[1] < $refline[1] &&
                $posline[2] <= $refline[2]    ){
                #print "proximal overlap\n";
                print  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tproximal overlap\n";
                print $output  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tproximal overlap\n";
                $ref_count++; 
                next; 
            }
            # am i distally overlapping?
            elsif ( $posline[1] >= $refline[1] &&
                $posline[2] > $refline[2]    ){
                #print "distal overlap\n";
                print  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tdistal overlap\n";
                print $output  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tdistal overlap\n";
                $ref_count++; 
                next; 
            }

            else {
                #print "encompassing\n";
                print  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tencompassing\n";
                print $output  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tencompassing\n";
                $ref_count++; 
                next;
            }       

        } # if a match with chr

        else {
            next;
        }

    } # for each reference
    $pos_count++;    
} # for each position

Data Files:

  • http://www.filedropper.com/proxdistalpositionsofrandompositions
  • http://www.filedropper.com/modifiedrefgene
  • Some output: http://www.filedropper.com/output_17

回答1:

I see two potential flaws in your code:

  1. Always use while when processing a file instead of for.

    Whenever you use the latter, you're actually loading the entire file into memory versus just doing line by line processing. If you're actually able to support doing that though, you should go ahead and load your smaller file entirely and just iterate on the lines.

  2. Split on "\t" not on '\t'.

    The latter is almost certainly a bug, unless you really do use a 2 character delimiter for your data.

Anyway, I've cleaned up your code considerably. Removing duplicated lines etc. It's likely that a lot of these changes may either not work (as it's untested) or not be what you want. However, if you go through the code, perhaps it will give you ideas at the very least:

#!/bin/usr/perl
use strict;
use warnings;
use autodie;

#############
## findGenes_after_ASboot_v5.pl
#############

#############
#  After making a big list of randomly placed intervals,
#  this script uses RefGene.txt file and identifies the 
#  the gene symbols encompassed or overlapped by each random interval 
#############

die "\n usage: $0 filename containig your list of positions and a RefGene-type file \n\n"
    if @ARGV != 2;

open my $positions, "<", $ARGV[0];

# Cache file by key
my %refgenes;
open my $RefGene,   "<", $ARGV[1];
while (<$RefGene>) {
    chomp;
    my @cols = split "\t";
    push @{$refgenes{$cols[0]}}, \@cols;
}

open my $output, ">>", "output.txt";

# print header
print $output "chr\tpos count\tpos1\tpos2\tchr\tref count\tref1\tref2\tname2\trelationship\n";

my $pos_count = 1;
my $ref_count = 1;

while (my $position_line = <$positions>) {
    chomp $position_line;
    my @posline = split "\t", $position_line;

    # Only iterate on matching refs
    for my $ref (@{ $refgenes{$posline[0]} }) {
        my @refline = @$ref;

        my $desc = join "\t", ($posline[0], $pos_count, @posline[1,2], $refline[0], $ref_count, @refline[1,2,3]);
        my $message = '';

        # am i entirely prox to a gene?
        if ( $posline[2] < $refline[1] ){
            $message = 'too proximal';

        # am i entirely distal to a gene?
        } elsif ( $posline[1] > $refline[2] ) {
            $message = 'too distal';

        # am i completely inside a gene?
        } elsif ( $posline[1] >= $refline[1] && $posline[2] <= $refline[2] ) {
            $message = 'inside';

        # am i proximally overlapping?
        } elsif ( $posline[1] < $refline[1] && $posline[2] <= $refline[2] ) {
            $message = 'proximal overlap';

        # am i distally overlapping?
        } elsif ( $posline[1] >= $refline[1] && $posline[2] > $refline[2] ) {
            $message = 'distal overlap';

        } else {
            $message = 'encompassing';
        }

        print "$desc\t$message\n";
        print $output "$desc\t$message\n";

        $ref_count++; 
    } # for each reference
    $pos_count++;
} # for each position