I have the next data:
miRNA17 70 105 dvex699824 12 233
miRNA17 21 60 dvex699824 42 20
miRNA17 55 89 dvex699824 6 40
miRNA18 58 85 dvex701176 119 92
miRNA17 66 105 dvex703815 35 75
miRNA17 31 71 dvex703815 43 83
miRNA17 39 79 dvex703815 43 83
miRNA2 28 56 dvex731981 313 286
miRNA17 10 70 dvex735428 142 203
miRNA17 29 91 dvex735428 213 152
miRNA17 66 105 dvex735668 163 125
The question is: If I have this 6 columns, I need to group and print in accordance to this rules:
the same miRNA## \t regardless \t regardless \t The same dvex#### \t Take the Lower \t Take the highest
For example this is the possible output:
miRNA17 21 105 dvex699824 6 233
miRNA18 58 85 dvex701176 119 92
miRNA17 31 105 dvex703815 35 83
miRNA2 28 56 dvex731981 313 286
miRNA17 10 105 dvex735428 142 203
What is the possible way to resolve this problem via Hashes-keys as arrays?
Perl script:
use strict;
# Not shown... Parse the data file, stuff into an array of arrays.
my @data = (
[ 'miRNA17', 70, 105, 'dvex699824', 12, 233 ],
[ 'miRNA17', 21, 60, 'dvex699824', 42, 20 ],
[ 'miRNA17', 55, 89, 'dvex699824', 6, 40 ],
[ 'miRNA18', 58, 85, 'dvex701176', 119, 92 ],
[ 'miRNA17', 66, 105, 'dvex703815', 35, 75 ],
[ 'miRNA17', 31, 71, 'dvex703815', 43, 83 ],
[ 'miRNA17', 39, 79, 'dvex703815', 43, 83 ],
[ 'miRNA2', 28, 56, 'dvex731981', 313, 286 ],
[ 'miRNA17', 10, 70, 'dvex735428', 142, 203 ],
[ 'miRNA17', 29, 91, 'dvex735428', 213, 152 ],
[ 'miRNA17', 66, 105, 'dvex735668', 163, 125 ]
);
my %results;
foreach my $record (@data) {
my ($mirna, $col2, $col3, $dvex, $col5, $col6) = @$record;
$results{$mirna}{$dvex}{col2} = $col2; # don't care.
$results{$mirna}{$dvex}{col3} = $col3; # don't care.
$results{$mirna}{$dvex}{col5} = $col5
if not $results{$mirna}{$dvex}{col5} or $results{$mirna}{$dvex}{col5} > $col5;
$results{$mirna}{$dvex}{col6} = $col6
if not $results{$mirna}{$dvex}{col6} or $results{$mirna}{$dvex}{col6} < $col6;
}
foreach my $mirna (keys %results) {
foreach my $dvex (sort keys %{$results{$mirna}}) {
printf "%-8s %5d %5d %-10s %3d %3d\n",
$mirna, $results{$mirna}{$dvex}{col2}, $results{$mirna}{$dvex}{col3},
$dvex, $results{$mirna}{$dvex}{col5}, $results{$mirna}{$dvex}{col6};
}
}
1;
Output:
miRNA2 28 56 dvex731981 313 286
miRNA17 55 89 dvex699824 6 233
miRNA17 39 79 dvex703815 35 83
miRNA17 29 91 dvex735428 142 203
miRNA17 66 105 dvex735668 163 125
miRNA18 58 85 dvex701176 119 92
It is much more efficient to process files sequentially, without loading everything into a big array first, whenever possible. Here is what such a solution could look like:
my @output_line = split / /, <IN_FILE>;
while (<IN_FILE>)
{
my @current_line = split / /, $_;
if ($current_line[0] ne $output_line[0])
{
printf OUT_FILE "%-8s %5d %5d %-10s %3d %3d\n", @output_line;
@output_line = @current_line;
}
else
{
$output_line[1] = $current_line[1] if ($current_line[1] < $output_line[1]);
$output_line[2] = $current_line[2] if ($current_line[2] > $output_line[2]);
$output_line[4] = $current_line[4] if ($current_line[4] < $output_line[4]);
$output_line[5] = $current_line[5] if ($current_line[5] > $output_line[5]);
}
}
printf OUT_FILE "%-8s %5d %5d %-10s %3d %3d\n", @output_line;
Caveat: your question stated that the output lines should have the "same dvex####". However, your sample output didn't show this. Thus I ignored that requirement. However, you could easily introduce that requirement simply by putting another condition in the if
statement.
Second Caveat: this approach also requires lines that will be grouped to be next to each other, as they were in your sample data.
This is a simple script that will produce the output you wanted, though it does more than your requirements state as it also checks min/max for columns 2 and 3.
I am using List::Util to get the min/max values, which is purely convenience. The module is core since v5.7.3, so it should not present a problem. Using Text::CSV is prudent, but might not be required, depending on your data. Assuming no non-tab whitespace in your columns, one can get away with using split
, which will remove the module dependency.
use strict;
use warnings;
use Text::CSV;
use List::Util qw(min max);
my $csv = Text::CSV->new({
sep_char => "\t",
eol => $/, # required for $csv->print
binary => 1,});
my %data;
my @order;
# *DATA and *STDOUT represent file handles, and can be replaced with
# any other file handle as you require. DATA is used here for simplicity.
#
while (my $row = $csv->getline(*DATA)) {
my ($mir, $dv) = @{$row}[0,3];
my $field = "$mir/$dv";
unless (defined $data{$field}) { # new fields are stored as-is
push @order, $field; # preserving original order of input
$data{$field} = $row;
next;
}
$data{$field}[1] = min($data{$field}[1], $row->[1]);
$data{$field}[2] = max($data{$field}[2], $row->[2]);
$data{$field}[4] = min($data{$field}[4], $row->[4]);
$data{$field}[5] = max($data{$field}[5], $row->[5]);
}
for my $field (@order) {
$csv->print(*STDOUT, $data{$field});
}
__DATA__
miRNA17 70 105 dvex699824 12 233
miRNA17 21 60 dvex699824 42 20
miRNA17 55 89 dvex699824 6 40
miRNA18 58 85 dvex701176 119 92
miRNA17 66 105 dvex703815 35 75
miRNA17 31 71 dvex703815 43 83
miRNA17 39 79 dvex703815 43 83
miRNA2 28 56 dvex731981 313 286
miRNA17 10 70 dvex735428 142 203
miRNA17 29 91 dvex735428 213 152
miRNA17 66 105 dvex735668 163 125
Output:
miRNA17 21 105 dvex699824 6 233
miRNA18 58 85 dvex701176 119 92
miRNA17 31 105 dvex703815 35 83
miRNA2 28 56 dvex731981 313 286
miRNA17 10 91 dvex735428 142 203
miRNA17 66 105 dvex735668 163 125
Note that the output does not match yours, because you have failed to distinguish between the dvex numbers for the last line of your sample input.