join 3 files by first Column with join (was awk)?

2020-07-09 07:45发布

问题:

i have three similar files, they are all like this:

File A

ID1 Value1a
ID2 Value2a
  .
  .
  .
IDN Value2n

and i want an output like this

Output

ID1 Value1a Value1b Value1c
ID2 Value2a Value2b Value2c
.....
IDN ValueNa ValueNb ValueNc

Looking to the first line, i want value1A to be the value of id1 in fileA, value1B the value of id1 in fileB, and so on which each field and each line. I thougth it like a sql join. I've tried several things but none of them where even close.

EDIT: All files have the same length and ids.

回答1:

Give join(1) a try:

join fileA fileB | join - fileC


回答2:

join (Dennis's answer) is better, but just for kicks, here's what I came up with in awk:

awk '{a=$0; getline b <"fileB"; getline c <"fileC"; $0=a" "b" "c; print $1,$2,$4,$6}' <fileA


回答3:

Update: The question has been edited to state that all files contain all keys, so the accepted answer (join) is definitely better than this one. Only consider using this one if it's possible the keys may not be in all files.


If you're not too concerned about performance, you could try the quick and dirty:

$ cat file_a
    ID5 Value5a
    ID1 Value1a
    ID3 Value3a
    ID4 Value4a
    ID2 Value2a
$ cat file_b
    ID1 Value1b
    ID3 Value3b
$ cat file_c
    ID2 Value2c
    ID3 Value3c
    ID4 Value4c
    ID5 Value5c
$ cat qq.sh
    #!/bin/bash
    keylist=$(awk '{print $1'} file_[abc] | sort | uniq)
    for key in ${keylist} ; do
        val_a=$(grep "^${key} " file_a | awk '{print $2}') ; val_a=${val_a:--}
        val_b=$(grep "^${key} " file_b | awk '{print $2}') ; val_b=${val_b:--}
        val_c=$(grep "^${key} " file_c | awk '{print $2}') ; val_c=${val_c:--}
        echo ${key} ${val_a} ${val_b} ${val_c}
    done
$ ./qq.sh
    ID1 Value1a Value1b -
    ID2 Value2a - Value2c
    ID3 Value3a Value3b Value3c
    ID4 Value4a - Value4c
    ID5 Value5a - Value5c

This actually works out the keys first then gets the values from each file with that key, or - if it's not in the relevant file.

The grep commands will need to be adjusted if the file is more complex (either if field 1 isn't at the start of the line or is followed by a non-space separator) but this should be a reasonable first-cut solution. The likely grep to use in that case would be:

grep "^[ X]*${key}[ X]"

where X is actually the tab character, as this allows for zero-or-more spaces or tabs before the key and a space or tab to terminate the key.

If the files are particularly large, you may want to look into using the associative arrays within awk but, since there's no indication of the size, I'd start with this one until you get to the point where it's running too slow.



回答4:

Just to add that in order for join to work the input should be sorted. This awk solution should handle any number of input files. You will also loose the original order of the keys (you'll need more code to preserve it).

awk 'END {
  for (K in k) print K, k[K]
    }
{ 
  k[$1] = k[$1] ? k[$1] FS $2 : $2 
  }' file1 file2 [.. filen]