Given these inputs:
my $init_seq = "AAAAAAAAAA" #length 10 bp
my $sub_rate = 0.003;
my $nof_tags = 1000;
my @dna = qw( A C G T );
I want to generate:
One thousand length-10 tags
Substitution rate for each position in a tag is 0.003
Yielding output like:
AAAAAAAAAA
AATAACAAAA
.....
AAGGAAAAGA # 1000th tags
Is there a compact way to do it in Perl?
I am stuck with the logic of this script as core:
#!/usr/bin/perl
my $init_seq = "AAAAAAAAAA" #length 10 bp
my $sub_rate = 0.003;
my $nof_tags = 1000;
my @dna = qw( A C G T );
$i = 0;
while ($i < length($init_seq)) {
$roll = int(rand 4) + 1; # $roll is now an integer between 1 and 4
if ($roll == 1) {$base = A;}
elsif ($roll == 2) {$base = T;}
elsif ($roll == 3) {$base = C;}
elsif ($roll == 4) {$base = G;};
print $base;
}
continue {
$i++;
}
As a small optimisation, replace:
$roll = int(rand 4) + 1; # $roll is now an integer between 1 and 4
if ($roll == 1) {$base = A;}
elsif ($roll == 2) {$base = T;}
elsif ($roll == 3) {$base = C;}
elsif ($roll == 4) {$base = G;};
with
$base = $dna[int(rand 4)];
EDIT: Assuming substitution rate is in the range 0.001 to 1.000:
As well as $roll
, generate another (pseudo)random number in the range [1..1000], if it is less than or equal to (1000 * $sub_rate) then perform the substitution, otherwise do nothing (i.e. output 'A').
Be aware that you may introduce subtle bias unless the properties of your random number generator are known.
Not exactly what you are looking for, but I suggest you take a look at BioPerl's Bio::SeqEvolution::DNAPoint module. It does not take mutation rate as a parameter though. Rather, it asks what the lower bound of sequence identity with the original you want.
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqEvolution::Factory;
my $seq = Bio::Seq->new(-seq => 'AAAAAAAAAA', -alphabet => 'dna');
my $evolve = Bio::SeqEvolution::Factory->new (
-rate => 2, # transition/transversion rate
-seq => $seq
-identity => 50 # At least 50% identity with the original
);
my @mutated;
for (1..1000) { push @mutated, $evolve->next_seq }
All 1000 mutated sequences will be stored in the @mutated array, their sequences can be accessed via the seq
method.
In the event of a substitution, you want to exclude the current base from the possibilities:
my @other_bases = grep { $_ ne substr($init_seq, $i, 1) } @dna;
$base = @other_bases[int(rand 3)];
Also please see Mitch Wheat's answer for how to implement the substitution rate.
I don't know if I understand correctly but I'd do something like this (pseudocode):
digits = 'ATCG'
base = 'AAAAAAAAAA'
MAX = 1000
for i = 1 to len(base)
# check if we have to mutate
mutate = 1+rand(MAX) <= rate*MAX
if mutate then
# find current A:0 T:1 C:2 G:3
current = digits.find(base[i])
# get a new position
# but ensure that it is not current
new = (j+1+rand(3)) mod 4
base[i] = digits[new]
end if
end for