How to compare first column of two files but get s

2019-07-27 11:56发布

问题:

I have two files (two columns each, split by tab) and I want to compare them based on the first column. If the value on the first column is the same on both files, I want to create a new file using second column values. Also, take into account that IDs in the first column of FILE1 can be duplicated. Basically I have:

FILE1:

TRINITY_DN10001_c0_g1_i1     TRINITY_DN10001_c0_g1_TRINITY_DN10001_c0_g1_i1_g.84091_m.84091
TRINITY_DN100032_c0_g2_i1    TRINITY_DN100032_c0_g2_TRINITY_DN100032_c0_g2_i1_g.20078_m.20078
TRINITY_DN100032_c0_g2_i1    TRINITY_DN100032_c0_g2_TRINITY_DN100032_c0_g2_i1_g.42263_m.42263
.....
TRINITY_DN99985_c0_g1_i1     TRINITY_DN99985_c0_g1_TRINITY_DN99985_c0_g1_i1_g.21199_m.21199

FILE2:

TRINITY_DN100007_c0_g1_i1   GO:0001071,GO:0003674
TRINITY_DN100032_c0_g2_i1   GO:0000149,GO:0001775
.....
TRINITY_DN99997_c0_g1_i1    GO:0000166,GO:0001882

And I need this:

TRINITY_DN100032_c0_g2_TRINITY_DN100032_c0_g2_i1_g.20078_m.20078    GO:0000149,GO:0001775
TRINITY_DN100032_c0_g2_TRINITY_DN100032_c0_g2_i1_g.42263_m.42263    GO:0000149,GO:0001775
.....

I think this can be done by combining two hash tables in Perl, somehow similar to this answer.

But I'm quite new with Perl so I exactly don't know how to do this. I would really appreciate if someone can help to modify the previous script (or to solve this problem in a different way).

Thanks in advance! ☺

回答1:

How big are the files? Are they small enough to fit in memory? Are they sorted?

Assuming that one of the files are small enough to fit in memory, you can read that file, and hash it - key is the first column, value is the second column. And then, read through the other file, checking the hash as you go to see if it exists, and, if so, print out the second columns (one of which is the value from the hash).

Assuming we have $file1 and $file2, and that $file1 is small enough, we get something like this:

open my $fh, '<', $file1 or die "Can't read $file1: $!";
my %file1 = map { split /\t/, $_, 2 } <$fh>; # this slurps in the file, be sure you can fit it all in memory multiple times over!
close $fh;
open $fh, '<', $file2 or die "Can't read $file2: $!";
while (<$fh>) {
    my ($k, $v) = split /\t/, $_, 2;
    if ($file1{$k}) {
        print join("\t", $file1{$k}, $v), "\n";
    }
}

Assuming the same, but allowing file1 to have duplicates:

open my $fh, '<', $file1 or die "Can't read $file1: $!";
my %file1;
while (<$fh>) {
    my ($k, $v) = split /\t/, $_, 2;
    push @{$file1{$k}}, $v;
}
close $fh;
open $fh, '<', $file2 or die "Can't read $file2: $!";
while (<$fh>) {
    my ($k, $v) = split /\t/, $_, 2;
    if ($file1{$k}) {
        print join("\t", $_, $v), "\n" for @{$file1{$k}};
    }
}

Note that the output will have the duplicate keys from file1 always in the same order as file1.