Perl: How to join two columns of a text file, in w

2020-04-20 21:00发布

问题:

I am a beginner with Perl programming. The problem I am working on right now is how to get the gene length from a text file. Text file contains the gene name (column 10), start site (column 6), end site (column 7). The length can be derived from the difference of column 6 and 7. But my problem is how to match the gene name (from column 10) with the corresponding difference derived from the difference of column 6 and column 7. Thank you very much!

open (IN, "Alu.txt");
open (OUT, ">Alu_subfamlength3.csv");

while ($a = <IN>) {
    @data = split (/\t/, $a);
    $list {$data[10]}++;
    $genelength {$data[7] - $data[6]};
}

foreach $sub (keys %list){
    $gene = join ($sub, $genelength);

    print "$gene\n";
}
close (IN);
close (OUT);

回答1:

I'm not sure about this as I haven't seen your data. But I think you're making this far harder than necessary. I think that everything you need for each gene is in a single line of the input file, so you can process the file a line at a time and not use any extra variables. Something like this:

open (IN, "Alu.txt");
open (OUT, ">Alu_subfamlength3.csv");

while ($a = <IN>) {
    @data = split (/\t/, $a);
    print "Gene: $data[10] / Length: ", $data[7] - $data[6], "\n";
}

But there are some improvements we can make. First, we'll stop using $a (which is a special variable and shouldn't be used in random code) and switch to $_ instead. At the same time we'll add use strict and use warnings and ensure that all of our variables are declared.

use strict;
use warnings;

open (IN, "Alu.txt");
open (OUT, ">Alu_subfamlength3.csv");

while (<IN>) { # This puts the line into $_
    my @data = split (/\t/); # split uses $_ by default
    print OUT "Gene: $data[10] / Length: ", $data[7] - $data[6], "\n";
}

Next we'll remove the unnecessary parentheses on the split() call and use a list slice to just get the values you want and store them in individual variables.

use strict;
use warnings;

open (IN, "Alu.txt");
open (OUT, ">Alu_subfamlength3.csv");

while (<IN>) { # This puts the line into $_
    my ($start, $end, $gene) = (split /\t/)[6, 7, 10]; # split uses $_ by default
    print OUT "Gene: $gene / Length: ", $end - $start, "\n";
}

Next, we'll remove the explicit filenames. Instead, we'll read data from STDIN and write it to STDOUT. This is a common Unix/Linux approach called an I/O filter. It will make your program more flexible (and, as a bonus, easier to write).

use strict;
use warnings;

while (<>) { # Empty <> reads from STDIN
    my ($start, $end, $gene) = (split /\t/)[6, 7, 10];
    # print to STDOUT
    print "Gene: $gene / Length: ", $end - $start, "\n";
}

To use this program, we use an operating system feature called I/O redirection. If the program is called filter_genes, we would call it like this:

$ ./filter_genes < Alu.txt > Alu_subfamlength3.csv

And if the names of your files change in the future, you don't need to change your program, just the command line that calls it.



回答2:

I assume your input data is tab delimited and you wanted an output csv file containing gene name and its corresponding gene length

Expected Output

genename1,12
genename2,20
genename3,8

Below is the code I made with those assumptions

use strict;
use warnings;

my $input_file;
my $output_file;

my %hash_gene;

open ($input_file,  "<testdata.txt") or die "Can not open file [$input_file]";
open ($output_file, ">outdata.txt")  or die "Can not open file [$output_file]";

while (<$input_file>) {
    chomp;
    my @data = split (/\t/, $_);

    $hash_gene{$data[10]} = $data[7] - $data[6];
}

foreach my $sub (keys %hash_gene){
    print $output_file "$sub,$hash_gene{$sub}\n";
}   
close ($input_file);
close ($output_file);

Notes

  • I modified the files names, change them as needed
  • Array index is 0-based, I assume you considered that when you mentioned the column numbers (Such that the first column is Column 0)