I am a beginner with Perl programming. The problem I am working on right now is how to get the gene length from a text file. Text file contains the gene name (column 10), start site (column 6), end site (column 7). The length can be derived from the difference of column 6 and 7. But my problem is how to match the gene name (from column 10) with the corresponding difference derived from the difference of column 6 and column 7. Thank you very much!
open (IN, "Alu.txt");
open (OUT, ">Alu_subfamlength3.csv");
while ($a = <IN>) {
@data = split (/\t/, $a);
$list {$data[10]}++;
$genelength {$data[7] - $data[6]};
}
foreach $sub (keys %list){
$gene = join ($sub, $genelength);
print "$gene\n";
}
close (IN);
close (OUT);
I'm not sure about this as I haven't seen your data. But I think you're making this far harder than necessary. I think that everything you need for each gene is in a single line of the input file, so you can process the file a line at a time and not use any extra variables. Something like this:
open (IN, "Alu.txt");
open (OUT, ">Alu_subfamlength3.csv");
while ($a = <IN>) {
@data = split (/\t/, $a);
print "Gene: $data[10] / Length: ", $data[7] - $data[6], "\n";
}
But there are some improvements we can make. First, we'll stop using $a
(which is a special variable and shouldn't be used in random code) and switch to $_
instead. At the same time we'll add use strict
and use warnings
and ensure that all of our variables are declared.
use strict;
use warnings;
open (IN, "Alu.txt");
open (OUT, ">Alu_subfamlength3.csv");
while (<IN>) { # This puts the line into $_
my @data = split (/\t/); # split uses $_ by default
print OUT "Gene: $data[10] / Length: ", $data[7] - $data[6], "\n";
}
Next we'll remove the unnecessary parentheses on the split()
call and use a list slice to just get the values you want and store them in individual variables.
use strict;
use warnings;
open (IN, "Alu.txt");
open (OUT, ">Alu_subfamlength3.csv");
while (<IN>) { # This puts the line into $_
my ($start, $end, $gene) = (split /\t/)[6, 7, 10]; # split uses $_ by default
print OUT "Gene: $gene / Length: ", $end - $start, "\n";
}
Next, we'll remove the explicit filenames. Instead, we'll read data from STDIN
and write it to STDOUT
. This is a common Unix/Linux approach called an I/O filter. It will make your program more flexible (and, as a bonus, easier to write).
use strict;
use warnings;
while (<>) { # Empty <> reads from STDIN
my ($start, $end, $gene) = (split /\t/)[6, 7, 10];
# print to STDOUT
print "Gene: $gene / Length: ", $end - $start, "\n";
}
To use this program, we use an operating system feature called I/O redirection. If the program is called filter_genes
, we would call it like this:
$ ./filter_genes < Alu.txt > Alu_subfamlength3.csv
And if the names of your files change in the future, you don't need to change your program, just the command line that calls it.
I assume your input data is tab delimited and you wanted an output csv file containing gene name and its corresponding gene length
Expected Output
genename1,12
genename2,20
genename3,8
Below is the code I made with those assumptions
use strict;
use warnings;
my $input_file;
my $output_file;
my %hash_gene;
open ($input_file, "<testdata.txt") or die "Can not open file [$input_file]";
open ($output_file, ">outdata.txt") or die "Can not open file [$output_file]";
while (<$input_file>) {
chomp;
my @data = split (/\t/, $_);
$hash_gene{$data[10]} = $data[7] - $data[6];
}
foreach my $sub (keys %hash_gene){
print $output_file "$sub,$hash_gene{$sub}\n";
}
close ($input_file);
close ($output_file);
Notes
- I modified the files names, change them as needed
- Array index is 0-based, I assume you considered that when you mentioned the column numbers (Such that the first column is Column 0)