Reading .fasta sequences to extract nucleotide dat

2019-02-19 15:36发布

问题:

Before I continue, I thought I'd refer readers to my previous problems with Perl, being a beginner to all of this.

These were my posts over the past few days, in chronological order:

  1. How do I average column values from a tab-separated data... (Solved)
  2. Why do I see no computed results in my output file? (Solved)
  3. Using a .fasta file to compute relative content of sequences

Now as I've stated above, thanks to help from a few of you, I've managed to figure out the first two queries and I've really learnt from it. I'm truly grateful. For a person who knows nothing about this, and still feels like he doesn't, the help was practically a Godsend.

The last query remains unsolved and this is a continuation. I did have a look at some of the recommended texts, but as I'm trying to get this finished before Monday, I'm unsure if I've overlooked anything completely. Either way, I have had a go at attempting the task.

Just so you know, the task is to open and read a .fasta file (I think I've finally nailed something pretty well, hallelujah!), read each sequence, compute the relative G+C nucleotide content, and then write to a TABDelimited file and the names of the genes and their respective G+C content.

Even though I've had a go at attempting this, I know that I am no where near ready to execute the program to provide the results that I'm after, which is why I'm reaching out to you guys again for some guidance, or examples of how to go about this. As with my previous, solved queries, I'd like it to be in a similar style to what I've already done them in - even though it might not be the most convenient/efficient way. It just allows me to know what I'm doing each step of the way, even though it seems like I'm spamming it up!

Anyway, the .fasta file reads something like:

>label
sequence
>label
sequence
>label
sequence

I'm unsure how to open the .fasta file, so I'm not sure what labels apply to which, but I know that the genes should be labelled either gag, pol, or env. Do I need to open the .fasta file to know what I'm doing, or can I do it 'blindly' by going with the above format?

It may be perfectly obvious, but I'm still struggling with all of this. I'm feeling like I should have caught on by now!

Anyway, the current code I have is as follows:

#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.
use strict; 

my $infile = "Lab1_seq.fasta";                               # This is the file path
open INFILE, $infile or die "Can't open $infile: $!";        # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt";             # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open

my $sequence = ();  # This sequence variable stores the sequences from the .fasta file
my $GC = 0;         # This variable checks for G + C content

my $line;                             # This reads the input file one-line-at-a-time
while ($line = <INFILE>) {
    chomp $line;                      # This removes "\n" at the end of each line (this is invisible)

    foreach my $line ($infile) {
        if($line = ~/^\s*$/) {         # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
            next;
        } elsif($line = ~/^\s*#/) {        # This finds lines with spaces before the hash character. Removes .fasta comment
            next; 
        } elsif($line = ~/^>/) {           # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
            next;
        } else {
            $sequence = $line;
        }
    }
    {
        $sequence =~ s/\s//g;               # Whitespace characters are removed
        return $sequence;
    }

I'm not sure if anything's correct here, but executing it left me with a syntax error ar line 35 (beyond the last line, and hence there isn't anything there!). It said at 'EOF'. That's about all I can point out. Otherwise I'm trying to figure out how to compute the quantities of the nucleotides G + C in each of the sequences, and then tabulating this properly in an output .txt file. I believe that's what is meant by a TABDelimited file?

In any case, I apologise if this query seems to be too lengthy, 'dumb' or a repeat, but in saying that, I couldn't find any information directly pertaining to this, so your help would be much appreciated, and the explanations for each step too if possible!!

Kindest.

回答1:

You have an extra brace right near the end. This should work:

#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.

use strict; 

my $infile = "Lab1_seq.fasta";                               # This is the file path
open INFILE, $infile or die "Can't open $infile: $!";        # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt";             # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open

my $sequence = ();  # This sequence variable stores the sequences from the .fasta file
my $GC = 0;         # This variable checks for G + C content

my $line;                             # This reads the input file one-line-at-a-time

while ($line = <INFILE>) {
    chomp $line;                      # This removes "\n" at the end of each line (this is invisible)

    if($line =~ /^\s*$/) {         # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
        next;

    } elsif($line =~ /^\s*#/) {        # This finds lines with spaces before the hash character. Removes .fasta comment
        next; 
    } elsif($line =~ /^>/) {           # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
        next;
    } else {
        $sequence = $line;
    }

    $sequence =~ s/\s//g;               # Whitespace characters are removed
    print OUTFILE $sequence;
}

Also I edited your return line. Return will exit your loop. I suspect what you want is to print it to a file, so I have done that. You may need to do some further transformation first to get it into a tab separated format.