Combining certain columns of several tab-delimited

2019-05-27 09:48发布

问题:

1st column in inFile contains a string not necessarily present in all inFiles

2nd and 7th columns in each inFile contains the Title# strings

Using AWK, I cannot piece this together correctly. My use of descriptive variables will hopefully help clarify what I'm trying to do. These are components I think I need:

  1. tab-separated input files: -F'\t'
  2. increment the strings in the 1st column, but only add each 'name' once to the '1stColumnNames': !1stColumnNames[$1]++ { name[++i] = $1 }
  3. make a new index for each .tsv file to store values for each file to avoid overwriting each column's values: !r[FILENAME]++ { ++argind }
  4. store corresponding column values in 2nd and 7th columns for each file: { 2ndColumnVals[$1, argind] = $2 } { 7thColumnVals[$1, argind] = $7 }
  5. print all 1stColumnNames with associated 2ndColumnVals and 7thColumnVals, including their headers 'Title1' 'Title2' 'Title3' etc. : ?????
  6. index values that were empty for a particular 2ndColumnVals or 7thColumnVals, print as Mtee: ?????
  7. do this for all .tsv files in the current working directory and ouput a new tsv file: *.tsv > outFile.tsv

Example Files

inFile1.tsv

Names   Title1  Title2
AAAA    1111    123456
BBBBB   1111    123456
CCC     1111    123456

inFile2.tsv

Names   Title3  Title4
BBBBB   2222    789456
DDDDD   2222    789456
EEEE    2222    789456

inFile3.tsv

Names   Title5  Title6
AAAA    3333    987654
CCC     3333    987654
EEEE    3333    987654

outFile123.tsv

Names   Title1  Title2  Title3  Title4  Title5  Title6
AAAA    1111    123456  Mtee    Mtee    3333    987654  
BBBBB   1111    123456  2222    789456  Mtee    Mtee
CCC     1111    123456  Mtee    Mtee    3333    987654
DDDDD   Mtee    Mtee    2222    789456  Mtee    Mtee
EEEE    Mtee    Mtee    2222    789456  3333    987654




Test Script

GNU Awk 4.0.1 is located in /usr/bin/awk , so I made this file and executed it in the same working directory where the 3 input files are located:

named as script1.sh
#### Example Usage:  script1.sh inFile1.tsv inFile2.tsv inFile3.tsv > outFile123.tsv

awk -F'\t' '
FNR==1 { ++numFiles}
!seen[$1]++ { keys[++numKeys] = $1 }
{ a[$1,numFiles] = $2 FS $3 }
END {
    for (keyNr=1; keyNr<=numKeys; keyNr++) {
        key = keys[keyNr]
        printf "%s", key
        for (fileNr=1;fileNr<=numFiles;fileNr++) {
            printf "\t%s", ((key,fileNr) in a ? a[key,fileNr] : "Mtee\tMtee")
        }
        print ""
    }
}
' "$@"

Running awk -F script1.awk inFile1.tsv inFile2.tsv inFile3.tsv > outFile123.tsv prints the follow error messages:

awk: cmd. line:1: inFile1.tsv

awk: cmd. line:1: ^ syntax error




Test Script 2 from konsolebox

works perfectly but I'm trying to understand each line by commenting:

#!/usr/bin/awk -f
#### named as script2.awk
#### Example Usage:  awk -f script2.awk inFile1.tsv inFile2.tsv inFile3.tsv > outFile123.tsv

BEGIN { FS = "\t" } #input File Style is tab-delimited
{ sub(/\r/, "") }   #remove all carriage return characters
!f[FILENAME]++ { ++indx }   #for all files inputted make a single index called indx
!a[$1]++ { keys[i++] = $1 } #the new indx comprises only unique strings in column 1
{ b[$1, indx] = $2 FS $3 }  #the 2nd and 3rd column are tab delimited and each pair that corresponds to a string saved in keys gets stored after the 1st column string in matrix b
END {
    for (i = 0; i in keys; ++i) {   #????
        key = keys[i]   #????
        printf "%s", keys   #prints out all strings in the index column 1 stored as keys
        for (j = 1; j <= indx; ++j) {   #????
            v = b[key, j]   #????
            printf "\t%s", length(v) ? v : "Mtee" FS "Mtee" #print out strings as tab delimited and replace any lengths of 1 char to two Mtee separated by a tab
        }
        print ""    #????
    }
}

回答1:

And here's another awk:

#!/usr/bin/awk -f
# Set field separator to tab (\t)
BEGIN { FS = "\t" }
# Remove carriage return characters if file is in DOS format (CRLF)
{ sub(/\r/, "") }
# Increment indx by 1 (starts at 0) everytime a new file is processed
!f[FILENAME]++ { ++indx }
# Add a key ($1) to keys array every time it is first encountered
!a[$1]++ { keys[i++] = $1 }
# Store the 2nd and 3rd field to b matrix
{ b[$1, indx] = $2 FS $3 }
# This block runs after all files are processed
END {
    # Traverse the keys in order
    for (i = 0; i in keys; ++i) {
        key = keys[i]
        # Print key
        printf "%s", key
        # Print columns from every file in order
        for (j = 1; j <= indx; ++j) {
            v = b[key, j]
            printf "\t%s", length(v) ? v : "Mtee" FS "Mtee"
        }
        # End the line with a newline
        print ""
    }
}

Usage:

awk -f script.awk file1 file2 file3

Output:

Names   Title1  Title2  Title3  Title4  Title5  Title6
AAAA    1111    123456  Mtee    Mtee    3333    987654
BBBBB   1111    123456  2222    789456  Mtee    Mtee
CCC     1111    123456  Mtee    Mtee    3333    987654
DDDDD   Mtee    Mtee    2222    789456  Mtee    Mtee
EEEE    Mtee    Mtee    2222    789456  3333    987654


回答2:

You need something like this:

Gawk version (for ARGIND plus true 2D arrays in gawk 4.0+):

$ gawk -F'\t' '
!seen[$1]++ { keys[++numKeys] = $1 }
{ a[$1][ARGIND] = $2 FS $3 }
END {
    for (keyNr = 1; keyNr <= numKeys; keyNr++) {
        key = keys[keyNr]
        printf "%s", key
        for (fileNr = 1; fileNr <= ARGIND; fileNr++) {
            printf "\t%s", (fileNr in a[key] ? a[key][fileNr] : "Mtee\tMtee")
        }
        print ""
    }
}
' file1 file2 file3

Non-gawk version:

awk -F'\t' '
FNR==1 { ++numFiles}
!seen[$1]++ { keys[++numKeys] = $1 }
{ a[$1,numFiles] = $2 FS $3 }
END {
    for (keyNr=1; keyNr<=numKeys; keyNr++) {
        key = keys[keyNr]
        printf "%s", key
        for (fileNr=1;fileNr<=numFiles;fileNr++) {
            printf "\t%s", ((key,fileNr) in a ? a[key,fileNr] : "Mtee\tMtee")
        }
        print ""
    }
}
' file1 file2 file3
Names   Title1  Title2  Title3  Title4  Title5  Title6
AAAA    1111    123456  Mtee    Mtee    3333    987654
BBBBB   1111    123456  2222    789456  Mtee    Mtee
CCC     1111    123456  Mtee    Mtee    3333    987654
DDDDD   Mtee    Mtee    2222    789456  Mtee    Mtee
EEEE    Mtee    Mtee    2222    789456  3333    987654


标签: awk merge tsv