I have 2 .vcf files with genomic data and I want to remove lines in the 1st file that are also present in the second file. The code I have so far it seems to iterate only one time, removing the first hit and then stops the search. Any help would be very appreciated since I can not figure out where the problem is. Sorry for any mis-code...
I opted for arrays of arrays instead of hashes because the initial order of the file must be maintained, and I think that this is better achieved with arrays...
Code:
#!/usr/bin/perl
use strict;
use warnings;
## bring files to program
MAIN: {
my ($chrC, $posC, $junkC, $baserefC, $allrestC);
my (@ref_arrayH, @ref_arrayC);
my ($chrH, $posH, $baserefH);
my $entriesH;
# open the het file first
open (HET, $het) or die "Could not open file $het - $!";
while (<HET>) {
if (defined) {
chomp;
if (/^#/) { next; }
if ( /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) { # is a VCF file
my @line_arrayH = split(/\t/, $_);
push (@ref_arrayH, \@line_arrayH); # the "reference" of each line_array is now in each element of @ref_array
$entriesH = scalar @ref_arrayH; # count the number of entries in the het file
}
}
}
close HET;
# print $entriesH,"\n";
open (CNS, $cns) or die "Could not open file $cns - $!";
foreach my $refH ( @ref_arrayH ) {
$chrH = $refH -> [0];
$posH = $refH -> [1];
$baserefH = $refH -> [3];
foreach my $line (<CNS>) {
chomp $line;
if ($line =~ /^#/) { next; }
if ($line =~ /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) { # is a VCF file
($chrC, $posC, $junkC, $baserefC, $allrestC) = split(/\t/,$line);
if ( $chrC eq $chrH and $posC == $posH and $baserefC eq $baserefH ) { next }
else { print "$line\n"; }
}
}
}
# close CNS;
}
CNS file:
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1087 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
HET file:
chrI 1087 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
I would like the output to be like this
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
but is giving me this instead:
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
Why is this nested loop not working properly? If I want to keep this structure of array-of-arrays, why is only doing the iteration the first time?
Would it be better to change the foreach loop
foreach my $refH ( @ref_arrayH ) {
with a for loop
for (my $i = 0; $i <= $entriesH; $i++) {
?
If you want to match something other than entire lines, extract the field(s) and use it (or their combination) to key into the
%seen
hash.This will use memory in proportion to the number of lines in the
HET
file.To reduce memory usage, you can tie
%seen
to a DBM_File on disk.If you are concerned about memory usage you can read both file one line at a time while doing the comparison. The following is an alternative approach.
Note: Because of the way filehandle works we have to reset connection every time we are to read from the file in the second nested loop.