extract multiples columns from txt file perl

2019-02-21 05:14发布

问题:

I have a txt file like this:

#Genera columnA columnB columnC columnD columnN
x1       1       3       7      0.9      2
x2       5       3       13     7        5
x3       0.1     0.8     7      1        0.4

and I want to extract X determinate number of columns, just suppose that we want columnA, columnC and columnN (this could be a matrix with 1, 2, 20, 100 or more columns) and What I want to print OUT (this example is just 3 but could be more):

#Genera columnA columnC columnN
    x1   1       7       2
    x2   5       13      5
    x3   0.1     7       0.4

I have tried

#!/usr/bin/perl
use strict;
use warnings;


my @wanted_fields = qw/columnA columnC columnN/;

open DATA, '<', "columns.txt" or die "cant open file\n";


my @datain = <DATA>;
close DATA;

my (@unit_name, $names, @lines, @conteo, @match_names, @columnas);

foreach (@datain){
    if ($_=~ m/^$/g)            {   next;           }
    elsif ($_=~ m/#Genera/g)    {   $names= $_;     }
    else                        {   push @lines, $_ }
}


@unit_name = split (/\t/, $names);
shift @unit_name;
my $count =0;

    foreach (@wanted_fields){
        my $unit_wanted =$_;
        chomp $unit_wanted;
        foreach (@unit_name){
            if ($_ =~ m/$unit_wanted/g){
                $count++;
                 push (@conteo, $count);
                 push (@match_names, $_);
                }
        }
    }


    foreach (@lines){
        chomp;
        @columnas = split (/\t/, $_);
            #push @xx, $columnas[0][3];

    }

I used the count to determinate the column to extract but in this case the number 2 do no correspond to columnC and 3 do not correspond to columnN well...... it is a any simple way to select any given columns, in this case I just want 3 but depend of the case could be 1,2 5, 10, 100 or more columns.

Thanks

回答1:

You can simplify like this and using hash slices.

#!/usr/bin/env perl
use strict;
use warnings;

my @wanted = ( '#Genera' , qw (  columnA columnC columnN ));

open my $input, '<', "file.txt" or die $!;

chomp ( my @header = split ' ', <$input> ); 

print join "\t", @wanted, "\n";
while ( <$input> ) { 
   my %row;
   @row{@header} = split; 
   print join "\t", @row{@wanted}, "\n";
}

Which outputs:

#Genera columnA columnC columnN 
x1  1   7   2   
x2  5   13  5   
x3  0.1 7   0.4 

If you want to exactly match your indentation then add sprintf to the mix:

E.g.:

print join "\t", map { sprintf "%8s", $_} @wanted, "\n";
while ( <$input> ) { 
   my %row;
   @row{@header} = split; 
   print join "\t", map { sprintf "%8s", $_} @row{@wanted}, "\n";
}

Which then gives:

 #Genera     columnA     columnC     columnN           
      x1           1           7           2           
      x2           5          13           5           
      x3         0.1           7         0.4    


回答2:

This program does as you ask. It expects the path to the input file as a parameter on the command line, which can then be read using the empty "diamond operator" <> without explicitly opening it

Each non-blank line of the file is split into fields, and the header line is identified by the first starting with a hash symbol #

A call to map converts the @wanted_fields array into a list of indexes into @fields where those column headers appear and stores it in array @idx

This array is then used to slice the wanted columns from @fields for every line of input. The fields are printed, separated by tabs

use strict;
use warnings 'all';

use List::Util 'first';

my @wanted_fields = qw/ columnA columnC columnN /;

my @idx;

while ( <> ) {
    next unless /\S/;

    my @fields = split;

    if ( $fields[0] =~ /^#/ ) {

        @idx = ( 0, map {
            my $wanted = $_;
            first { $fields[$_] eq $wanted } 0 .. $#fields;
        } @wanted_fields );
    }

    print join( "\t", @fields[@idx] ), "\n" if @idx;
}

output

#Genera columnA columnC columnN
x1  1   7   2
x2  5   13  5
x3  0.1 7   0.4


回答3:

There are command line switches that are used for this kind of application:

perl -lnae 'print join "\t", @F[1,3,5]' file.txt

Switch -a automatically creates variable @F for each line, split by whitespace. So @F[1,3,5] is an array slice of elements 1, 3, and 5.

The downside of this, of course, is that you have to use the column numbers instead of the names.