I want find every incident of ATG...TAG or ATG...TAA. I have tried the following:
#!/usr/bin/perl
use warnings;
use strict;
my $file = ('ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC');
while($file =~ /((?=(ATG\w+?TAG|ATG\w+?TAA))/g){
print "$1\n";
}
which gives-
ATGCCCCCCCCCCCCCTAG
ATGAAAAAAAAAATAAATGAAAAATAG
ATGAAAAATAG
I want -
ATGCCCCCCCCCCCCCTAG
ATGAAAAAAAAAATAA
ATGAAAAATAG
What im doing wrong?
You are actually very close, it appears from your statement above that you have two captures, when I think you really only want a single one; I also don't think you need the lookahead.
#!/usr/bin/perl
use warnings;
use strict;
my $file = ('ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC');
while($file =~ /(ATG\w+?TA[AG])/g){
print "$1\n";
}
# output
# ATGCCCCCCCCCCCCCTAG
# ATGAAAAAAAAAATAA
# ATGAAAAATAG
Line by line:
ATG matches a literal ATG
\w+? optionally matches one or more characters
TA[AG] matches a literal TAA or TAG
/(ATG\w+?TA[AG])/
works and is a bit more elegant than what FlyingFrog
proposed ;-)
-bash-3.2$ perl
my $string = 'ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC';
my @matches = $string =~ /(ATG\w+?TA[AG])/g;
use Data::Dumper;
print Dumper \@matches;
$VAR1 = [
'ATGCCCCCCCCCCCCCTAG',
'ATGAAAAAAAAAATAA',
'ATGAAAAATAG'
];
Your code will find sequences starting with ATG
and ending in TAG
or TAA
- whichever comes first. If you removed all the TAG
s from your sequence, you would find the stretches that end in TAA
. By making two capture groups (one for ATG...TAG
and one for ATG...TAA
) you will find all sequences.
#!/usr/bin/perl
use warnings;
use strict;
my $file = ('ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC');
while($file =~ /(?=(ATG\w+?TAG))(?=(ATG\w+?TAA))/g){ # makes two capture groups
print "$1\n";
print "$2\n";
}
Output:
ATGCCCCCCCCCCCCCTAG
ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAA
ATGAAAAAAAAAATAAATGAAAAATAG
ATGAAAAAAAAAATAA
---- OR: ----
#!/usr/bin/perl
use warnings;
use strict;
my $file = ('ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC');
while($file =~ /(?=(ATG\w+?TA[AG]))/g){
print "$1\n";
}
Output:
ATGCCCCCCCCCCCCCTAG
ATGAAAAAAAAAATAA
ATGAAAAATAG
Depending on what exactly you're after...