I want to store DNA sequences of size n in the described data structure. Each hash could contain the keys C,G,A,T who will have hash values. These hash values will be the exact same kind of hashes - they will have four keys, C,G,A,T who will have hash values.
This structure is consistent for n levels of hashes. However, the last level of hashes will instead have integer values, which represent the count of the sequence from level 1 to level n.
Given the data ('CG', 'CA', 'TT', 'CG'), indicating that the sequences CG, CA, and TT occurred twice, once, and once. For this data, the depth would be 2.
This would produce a hash:
%root = ( 'C' => { 'G' => 2, 'A' => 1}, 'T' => {'T' => 1 })
How would one create this hash from the data?
What you need is a function get_node($tree, 'C', 'G')
returns a reference to the hash element for "CG". Then you can just increment the referenced scalar.
sub get_node {
my $p = \shift;
$p = \( ($$p)->{$_} ) for @_;
return $p;
}
my @seqs = qw( CG CA TT CG );
my $tree;
++${ get_node($tree, split //) } for @seqs;
The thing is, this function already exists as Data::Diver's DiveRef
.
use Data::Diver qw( DiveRef );
my @seqs = qw( CG CA TT CG );
my $tree = {};
++${ DiveRef($tree, split //) } for @seqs;
In both case,
use Data::Dumper qw( Dumper );
print(Dumper($tree));
prints
$VAR1 = {
'T' => {
'T' => 1
},
'C' => {
'A' => 1,
'G' => 2
}
};
The following should work:
use Data::Dumper;
my %data;
my @sequences = qw(CG CG CA TT);
foreach my $sequence (@sequences) {
my @vars = split(//,$sequence);
$data{$vars[0]} = {} if (!exists($data{$vars[0]}));
my $startref = $data{$vars[0]};
for(my $i = 1; $i < $#vars; $i++) {
$startref->{$vars[$i]} = {} if (!exists($startref->{$vars[$i]}));
$startref = $startref->{$vars[$i]};
}
$startref->{$vars[$#vars]}++;
}
print Dumper(\%data);
Produces:
$VAR1 = {
'T' => {
'T' => 1
},
'C' => {
'A' => 1,
'G' => 2
}
};