I have a fileA as shown below:
file A
chr1 123 aa b c d
chr1 234 a b c d
chr1 345 aa b c d
chr1 456 a b c d
....
And I have a bunch of similar files with similar columns in a dirB with which i have to compare file A.
To do this I concatenated all the files in dirB using cat into a single file called fileB and then compared both the files based on key columns 1 and 2 as shown below:
awk 'FNR==NR{a[$1,$2]++;next}!a[$1,$2]' fileB fileA
This command uses the columns 1 and 2 as keys and gives the rows which have key only in fileA.
However, the issue here is, fileB is to huge to handle in terms of space and memory to run when there are large number of files.
Could someone suggest an alternative, so that it skips the step of concatenating all files to create fileB. Instead, fileA could be directly compared with all the files in dirB
chr1 123 aa b c d xxxx abcd
chr1 234 a b c d
chr1 345 aa b c d yyyy defg
chr1 456 a b c d
Perhaps something along these lines:
Starting with file A, add each key to an array with the contents of its row for the value. Then for all the B files, delete any elements from the array with matching keys. At the end any elements remaining are those in A that weren't in any of the B files so we can just loop through and print them out.