Simplifying elements of a list/array and then addi

2019-09-05 07:38发布

问题:

I'm processing headers of a .fasta file (which is a file universally used in genetics/bioinformatics to store DNA/RNA sequence data). Fasta files have headers starting with a > symbol (which gives specific info), followed by the actual sequence data on the next line that the header describes. The sequence data extends indefinitely until the next \n after which is followed the next header and its respective sequence. For example:

>scaffold1.1_size947603
ACGCTCGATCGTACCAGACTCAGCATGCATGACTGCATGCATGCATGCATCATCTGACTGATG....
>scaffold2.1_size747567.2.603063_605944
AGCTCTGATCGTCGAAATGCGCGCTCGCTAGCTCGATCGATCGATCGATCGACTCAGACCTCA....

and so on...

So, I have a problem with the fasta headers of the genome for the organism with which I am working with. Unfortunately the perl expertise needed to solve this problem seems to be beyond my current skill level :S So I was hoping someone on here could show me how it can be done.

My genome consists of about 25000 fasta headers and their respective sequences, the headers in their current state are giving me a lot of trouble with sequence aligners I am trying to use, so I have to simplify them significantly. Here is an example of my first few headers:

>scaffold1.1_size947603
>scaffold10.1_size550551
>scaffold100.1_size305125:1-38034
>scaffold100.1_size305125:38147-38987
>scaffold100.1_size305125:38995-44965
>scaffold100.1_size305125:76102-78738
>scaffold100.1_size305125:84171-87568
>scaffold100.1_size305125:87574-89457
>scaffold100.1_size305125:90495-305068
>scaffold1000.1_size94939

Essentially I would like to refine these to look like this:

>scaffold1.1a
>scaffold10.1a
>scaffold100.1a
>scaffold100.1b
>scaffold100.1c
>scaffold100.1d
>scaffold100.1e
>scaffold100.1f
>scaffold100.1g
>scaffold1000.1a

Or perhaps even this (but this seems like it would be more complicated):

>scaffold1.1
>scaffold10.1
>scaffold100.1a
>scaffold100.1b
>scaffold100.1c
>scaffold100.1d
>scaffold100.1e
>scaffold100.1f
>scaffold100.1g
>scaffold1000.1

What I'm doing here is getting rid of all the size data for each scaffold of the genome. For scaffolds that happen to be fragmented, I'd like to denote them with a,b,c,d etc. There are a few scaffolds with more than 26 fragments so perhaps I could denote them with x, y, z, A, B, C, D .... etc..

I was thinking to do this with a simple replace foreach loop similar to this:

#!/usr/bin/perl -w

### Open the files 
$gen = './Hc_genome/haemonchus_V1.fa';
open(FASTAFILE, $gen);
@lines = <FASTAFILE>;
#print @lines; 

###Add an @ symbol to the start of the label
my @refined;
foreach my $lines (@lines){ 
    chomp $lines;
    $lines =~ s/match everything after .1/replace it with a, b, c.. etc/g;
    push @refined, $lines;
}
#print @refined;


###Push the array on to a new fasta file
open FILE3, "> ./Hc_genome/modded_haemonchus_V1.fa" or die "Cannot open output.txt: $!";

foreach (@refined)
{
    print FILE3 "$_\n"; # Print each entry in our array to the file
}
close FILE3;  

But I don't know have to build in the added alphabetical label additions between the $1 and the \n in the match and replace operator. Essentially because I'm not sure how to do it sequentially through the alphabet for each fragment of a particular scaffold (All I could manage is to add an a to the start of each one...)

Please if you don't mind, let me know how I might achieve this!

Much appreciated!

Andrew

回答1:

In Perl, the increment operator ++ has “magical” behaviour with respect to strings. E.g. my $s = "a"; $a++ increments $a to "b". This goes on until "z", where the increment will produce "aa" and so forth.

The headers of your file appear to be properly sorted, so we can just loop through each header. From the header, we extract the starting part (everything up to including the .1). If this starting part is the same as the starting part of the previous header, we increment our sequence identifier. Otherwise, we set it to "a":

use strict; use warnings;  # start every script with these

my $index = "a";
my $prev = "";

# iterate over all lines (rather than reading all 25E3 into memory at once)
while (<>) {

  # pass through non-header lines
  unless (/^>/) {
    print;  # comment this line to remove non-header lines
    next;
  }

  s/\.1\K.*//s;  # remove everything after ".1". Implies chomping

  # reset or increment $index
  if ($_ eq $prev) {
    $index++;
  } else {
    $index = "a";
  }

  # update the previous line
  $prev = $_;

  # output new header
  print "$_$index\n";
}

Usage: $ perl script.pl <./Hc_genome/haemonchus_V1.fa >./Hc_genome/modded_haemonchus_V1.fa.

It is considered good style to write programs that accept input from STDIN and write to STDOUT, as this improves flexibility. Rather than hardcoding paths in your perl script, keep your script general, and use shell redirection operators like < to specify the input. This also saves you the hassle of manually opening the files.

Example Output:

>scaffold1.1a
>scaffold10.1a
>scaffold100.1a
>scaffold100.1b
>scaffold100.1c
>scaffold100.1d
>scaffold100.1e
>scaffold100.1f
>scaffold100.1g
>scaffold1000.1a