awk + bash: combining arbitrary number of files

2019-09-12 04:11发布

问题:

I have a script that takes a number of data files with identical layout but different data and combines a specified data column into a new file, like this:

gawk '{
        names[$1]= 1;
        data[$1,ARGIND]= $2
} END {
        for (i in names) print i"\t"data[i,1]"\t"data[i,2]"\t"data[i,3]
}' $1 $2 $3 > combined_data.txt

... where the row IDs can be found in the first column, and the interesting data in the second column.

This works nicely, but not for an arbitrary number of files. While I could simply add $4 $5 ... $n in the last line up to whatever maximum number of files I think I need, as well as add an equal n amount of "\t"data[i,4]"\t"data[i,5] ... "\t"data[i,n] in the line above that (which does seem to work even for files smaller than n; awk seems to disregard that n is larger than the number of input files in those cases), this seems like an "ugly" solution. Is there a way to make this script (or something that gives the same result) take an arbitrary number of input files?

Or, even better, can you somehow incorporate a find in there, that searches through subfolders and finds files matching some criterium?

Here is some sample data:

file.1

A      554
B       13
C      634
D       84
E        9

file.2:

C      TRUE
E      TRUE
F      FALSE

expected output:

A      554
B       13
C      634       TRUE
D       84
E        9       TRUE
F                FALSE

回答1:

This may be what you're looking for (uses GNU awk for ARGIND just like your original script):

$ cat tst.awk
BEGIN { OFS="\t" }
!seen[$1]++ { keys[++numKeys]=$1 }
{ vals[$1,ARGIND]=$2 }
END {
    for (rowNr=1; rowNr<=numKeys; rowNr++) {
        key = keys[rowNr]
        printf "%s%s", key, OFS
        for (colNr=1; colNr<=ARGIND; colNr++) {
            printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
        }
    }
}

$ awk -f tst.awk file1 file2
A       554
B       13
C       634     TRUE
D       84
E       9       TRUE
F               FALSE

If you don't care about the order the rows are output in then all you need is:

BEGIN { OFS="\t" }
{ vals[$1,ARGIND]=$2; keys[$1] }
END {
    for (key in keys) {
        printf "%s%s", key, OFS
        for (colNr=1; colNr<=ARGIND; colNr++) {
            printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
        }
    }
}


回答2:

You can access an arbitrary number of files via redirected getline on the ARGV list (bypassing awk's default file processing (via BEGIN and exit)):

awk 'BEGIN {
  for(i=1;i<=ARGC;++i){
    while (getline < ARGV[i]) {
      ...
      }
    }
  <END-type code>
  exit}' $(find -type f ...)


回答3:

Supposing this naming schema for the input files: 1 2 ....

   gawk '{ 
        names[$1]=$1
        data[$1,ARGIND]=$2
      } 
      END {
        for (i in names) {
           printf("%s\t",i)
           for (x=1;x<=ARGIND;x++) {
             printf("%s\t", data[i,x])
             }
           print ""
           }
       }' [0-9]* > combined_data.txt

Results:

A   554 
B   13  
C   634 TRUE
D   84  
E   9   TRUE
F       FALSE


回答4:

Another solution using join,bash,awk and tr, if file1, file2, file3, etc. are sorted

multijoin.sh

#!/bin/bash
function __t { 
  join -a1 -a2 -o '1.1 2.1 1.2 2.2' - "$1" | 
  awk -vFS='[ ]' '{print ($1!=""?$1:$2),$3"_"$4;}'; 
}
CMD="cat '$1'"
for i in `seq 2 $#`; do
  CMD="$CMD | __t '${@:$i:1}'";
done
eval "$CMD | tr '_' '\t' | tr ' ' '\t'";

or, recursive version

#!/bin/bash
function __t { 
  join -a1 -a2 -o '1.1 2.1 1.2 2.2' - "$1" | 
  awk -vFS='[ ]' '{print ($1!=""?$1:$2),$3"_"$4;}'; 
}
function __r { 
  if [[ "$#" -gt 1 ]]; then
    __t "$1" | __r "${@:2}"; 
  else
    __t "$1"; 
  fi
}
__r "${@:2}" < "$1" | tr '_' '\t' | tr ' ' '\t'

NOTE: the data cannot contain the character _, this was used as a wildcard

you get,

./multijoin file1 file2
A   554
B   13
C   634 TRUE
D   84
E   9   TRUE
F       FALSE

for example, if file3 contains

A    111
D    222
E    333
./multijoin file1 file2 file3

you get,

A   554       111
B   13      
C   634 TRUE    
D   84        222
E   9   TRUE  333
F       FALSE   


标签: bash awk gawk