I am a beginner with Perl programming. The problem I am working on right now is how to get the gene length from a text file. Text file contains the gene name (column 10), start site (column 6), end site (column 7). The length can be derived from the difference of column 6 and 7. But my problem is how to match the gene name (from column 10) with the corresponding difference derived from the difference of column 6 and column 7. Thank you very much!
open (IN, "Alu.txt");
open (OUT, ">Alu_subfamlength3.csv");
while ($a = <IN>) {
@data = split (/\t/, $a);
$list {$data[10]}++;
$genelength {$data[7] - $data[6]};
}
foreach $sub (keys %list){
$gene = join ($sub, $genelength);
print "$gene\n";
}
close (IN);
close (OUT);
I'm not sure about this as I haven't seen your data. But I think you're making this far harder than necessary. I think that everything you need for each gene is in a single line of the input file, so you can process the file a line at a time and not use any extra variables. Something like this:
But there are some improvements we can make. First, we'll stop using
$a
(which is a special variable and shouldn't be used in random code) and switch to$_
instead. At the same time we'll adduse strict
anduse warnings
and ensure that all of our variables are declared.Next we'll remove the unnecessary parentheses on the
split()
call and use a list slice to just get the values you want and store them in individual variables.Next, we'll remove the explicit filenames. Instead, we'll read data from
STDIN
and write it toSTDOUT
. This is a common Unix/Linux approach called an I/O filter. It will make your program more flexible (and, as a bonus, easier to write).To use this program, we use an operating system feature called I/O redirection. If the program is called
filter_genes
, we would call it like this:And if the names of your files change in the future, you don't need to change your program, just the command line that calls it.
I assume your input data is tab delimited and you wanted an output csv file containing gene name and its corresponding gene length
Expected Output
Below is the code I made with those assumptions
Notes