I have a script that takes a number of data files with identical layout but different data and combines a specified data column into a new file, like this:
gawk '{
names[$1]= 1;
data[$1,ARGIND]= $2
} END {
for (i in names) print i"\t"data[i,1]"\t"data[i,2]"\t"data[i,3]
}' $1 $2 $3 > combined_data.txt
... where the row IDs can be found in the first column, and the interesting data in the second column.
This works nicely, but not for an arbitrary number of files. While I could simply add $4 $5 ... $n
in the last line up to whatever maximum number of files I think I need, as well as add an equal n
amount of "\t"data[i,4]"\t"data[i,5] ... "\t"data[i,n]
in the line above that (which does seem to work even for files smaller than n
; awk seems to disregard that n
is larger than the number of input files in those cases), this seems like an "ugly" solution. Is there a way to make this script (or something that gives the same result) take an arbitrary number of input files?
Or, even better, can you somehow incorporate a find
in there, that searches through subfolders and finds files matching some criterium?
Here is some sample data:
file.1
A 554
B 13
C 634
D 84
E 9
file.2:
C TRUE
E TRUE
F FALSE
expected output:
A 554
B 13
C 634 TRUE
D 84
E 9 TRUE
F FALSE
This may be what you're looking for (uses GNU awk for ARGIND just like your original script):
$ cat tst.awk
BEGIN { OFS="\t" }
!seen[$1]++ { keys[++numKeys]=$1 }
{ vals[$1,ARGIND]=$2 }
END {
for (rowNr=1; rowNr<=numKeys; rowNr++) {
key = keys[rowNr]
printf "%s%s", key, OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
}
}
}
$ awk -f tst.awk file1 file2
A 554
B 13
C 634 TRUE
D 84
E 9 TRUE
F FALSE
If you don't care about the order the rows are output in then all you need is:
BEGIN { OFS="\t" }
{ vals[$1,ARGIND]=$2; keys[$1] }
END {
for (key in keys) {
printf "%s%s", key, OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
}
}
}
You can access an arbitrary number of files via redirected getline on the ARGV list (bypassing awk's default file processing (via BEGIN and exit)):
awk 'BEGIN {
for(i=1;i<=ARGC;++i){
while (getline < ARGV[i]) {
...
}
}
<END-type code>
exit}' $(find -type f ...)
Supposing this naming schema for the input files: 1
2
....
gawk '{
names[$1]=$1
data[$1,ARGIND]=$2
}
END {
for (i in names) {
printf("%s\t",i)
for (x=1;x<=ARGIND;x++) {
printf("%s\t", data[i,x])
}
print ""
}
}' [0-9]* > combined_data.txt
Results:
A 554
B 13
C 634 TRUE
D 84
E 9 TRUE
F FALSE
Another solution using join
,bash
,awk
and tr
, if file1
, file2
, file3
, etc. are sorted
multijoin.sh
#!/bin/bash
function __t {
join -a1 -a2 -o '1.1 2.1 1.2 2.2' - "$1" |
awk -vFS='[ ]' '{print ($1!=""?$1:$2),$3"_"$4;}';
}
CMD="cat '$1'"
for i in `seq 2 $#`; do
CMD="$CMD | __t '${@:$i:1}'";
done
eval "$CMD | tr '_' '\t' | tr ' ' '\t'";
or, recursive version
#!/bin/bash
function __t {
join -a1 -a2 -o '1.1 2.1 1.2 2.2' - "$1" |
awk -vFS='[ ]' '{print ($1!=""?$1:$2),$3"_"$4;}';
}
function __r {
if [[ "$#" -gt 1 ]]; then
__t "$1" | __r "${@:2}";
else
__t "$1";
fi
}
__r "${@:2}" < "$1" | tr '_' '\t' | tr ' ' '\t'
NOTE: the data cannot contain the character _
, this was used as a wildcard
you get,
./multijoin file1 file2
A 554
B 13
C 634 TRUE
D 84
E 9 TRUE
F FALSE
for example, if file3
contains
A 111
D 222
E 333
./multijoin file1 file2 file3
you get,
A 554 111
B 13
C 634 TRUE
D 84 222
E 9 TRUE 333
F FALSE