Find multiple matches of this and that nucleotide

2019-08-10 00:57发布

问题:

I want find every incident of ATG...TAG or ATG...TAA. I have tried the following:

#!/usr/bin/perl
use warnings;
use strict; 

my $file = ('ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC');

while($file =~ /((?=(ATG\w+?TAG|ATG\w+?TAA))/g){ 
    print "$1\n";           
} 

which gives-

ATGCCCCCCCCCCCCCTAG
ATGAAAAAAAAAATAAATGAAAAATAG
ATGAAAAATAG

I want -

ATGCCCCCCCCCCCCCTAG
ATGAAAAAAAAAATAA
ATGAAAAATAG

What im doing wrong?

回答1:

You are actually very close, it appears from your statement above that you have two captures, when I think you really only want a single one; I also don't think you need the lookahead.

#!/usr/bin/perl
use warnings;
use strict;

my $file = ('ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC');

while($file =~ /(ATG\w+?TA[AG])/g){
    print "$1\n";
}

# output
# ATGCCCCCCCCCCCCCTAG
# ATGAAAAAAAAAATAA
# ATGAAAAATAG

Line by line:

ATG matches a literal ATG

\w+? optionally matches one or more characters

TA[AG] matches a literal TAA or TAG



回答2:

/(ATG\w+?TA[AG])/ works and is a bit more elegant than what FlyingFrog proposed ;-)

-bash-3.2$ perl
my $string = 'ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC';
my @matches = $string =~ /(ATG\w+?TA[AG])/g;
use Data::Dumper;
print Dumper \@matches;
$VAR1 = [
          'ATGCCCCCCCCCCCCCTAG',
          'ATGAAAAAAAAAATAA',
          'ATGAAAAATAG'
        ];


回答3:

Your code will find sequences starting with ATG and ending in TAG or TAA - whichever comes first. If you removed all the TAGs from your sequence, you would find the stretches that end in TAA. By making two capture groups (one for ATG...TAG and one for ATG...TAA) you will find all sequences.

#!/usr/bin/perl
use warnings;
use strict; 

my $file = ('ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC');

while($file =~ /(?=(ATG\w+?TAG))(?=(ATG\w+?TAA))/g){ # makes two capture groups 
    print "$1\n";
    print "$2\n";           
} 

Output:

ATGCCCCCCCCCCCCCTAG
ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAA
ATGAAAAAAAAAATAAATGAAAAATAG
ATGAAAAAAAAAATAA

---- OR: ----

#!/usr/bin/perl
use warnings;
use strict; 

my $file = ('ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC');

while($file =~ /(?=(ATG\w+?TA[AG]))/g){ 
    print "$1\n";
} 

Output:

ATGCCCCCCCCCCCCCTAG
ATGAAAAAAAAAATAA
ATGAAAAATAG

Depending on what exactly you're after...