How to compare and merge multiple files?

2019-09-22 03:53发布

问题:

reference file

chr1    288598  288656

chr1    779518  779576

chr2    2569592 2569660

chr3    5018399 5018464

chr4    5182842 5182882

file1

chr1    288598  288656 12

chr1    779518  779576 14

chr2    2569592 2569660 26

chr3    5018399 5018464 27

chr4    5182842 5182882 37

file2

chr1    288598  288656 35

chr2    2569592 2569660 348

chr3    5018399 5018464 4326

chr4    5182842 5182882 68

I have six similar files excluding the reference file.

Here first three fields are similar to the reference file. Therefore, I would like export only 4th column from all 6 files and put into the reference file to make a new output. which should be equivalent to the reference files. Where they don't match put zero.

desired output

chr1    288598  288656 23 35 57 68 769 68

chr1    779518  779576 23 0 57 68 768 0

chr2    2569592 2569660 23 35 0 68 79 0

chr3    5018399 5018464 0 36 0 68 769 0

chr4    5182842 5182882 23 0 0 0 0 0

Note: the reference file length is about 2000 ans the other files are not always in same the length (about 500, 400, 200, 100 etc). That is why need zero added.

I tried the answer from this question

paste ref.file file1 file2 file3 file4 file5 file6 |  awk '{OFS="\t";print $1,$2,$3,$7,$11,$15,$19,$23,$27}' > final.common.out

but seems it's not working — some values are missed. And I can't understand how to add zero where there is no match.

回答1:

I think something like this should do what you want. We use a hash to gather the 'reference' file and turn it into a set of keys with an empty array.

Then we iterate on the other files, extracting '3 values' as key, and the last value as an actual value.

And then we compare the two, updating the 'reference' hash with either the value or zero. The caveat here - any lines not in your reference file (or duplicates) will just disappear.

#!/usr/bin/perl

use strict;
use warnings;
use autodie;


#read 'reference file' into a hash:
my %ref;
open( my $ref_fh, "<", "reference_file" );
while (<$ref_fh>) {
    my ( $first, $second, $third ) = split;

    #turn the first three fields into space delimited key.
    $ref{"$first $second $third"} = ();
}

#open each of the files.
my @files = qw ( file1 file2 file3 file4 file5 file6 );
foreach my $input (@files) {
    open( my $input_fh, "<", $input );
    my %current;
    while (<$input_fh>) {

        #line by line, extract 'first 3 fields' to use as a key.
        #then 'value' which we store.
        my ( $first, $second, $third, $value ) = split;
        $current{"$first $second $third"} = $value;
    }

    #refer to 'reference file' and insert matching value or zero into
    #the array.
    foreach my $key ( keys %ref ) {
        push( @{ $ref{$key} }, $current{$key} ? $current{$key} : 0 );
    }
}

foreach my $key ( keys %ref ) {
    print join( " ", $key, @{ $ref{$key} } );
}