UNIX Shell Script Solution for formatting a pipe-d

2019-08-08 14:52发布

问题:

The input file has up to 34 different record types within the same line.

The file is pipe-delimited, and each record type is separated by '~' (except for the originating record type.

Not all 34 record types are contained on each line, and I do not need all of them.

All record types will be sent within a specified order, but not all record types will always be sent. The first record type is mandatory and will always be sent. Out of the 34 types, there are only 7 that are mandatory.

Each record type has a predefined number of fields and that should never deviate from that definition without proper lead time between the client and our load.

The Oracle table will be constructed with all of the required columns based upon the needed record types. So one row will contain information from each record type similar to the input file, but will additionally include nulls for the columns which would come from certain record types that were not included in the input.

The end result I'm looking for is a way to perform conditional formatting to the input file in order to generate an output that can be simply loaded within a shell script via sqlldr instead of going through PL/SQL (as I want my non-PL/SQL coworkers to be able to troubleshoot/fix any issues encountered during loads).

Small example with 3 records (data types do not matter in this example):

Record Types:  AA, BB, CC, DD, EE, FF  
AA has 5 fields (Mandatory)  
BB has 2 fields (Optional)  
CC has 3 fields (Optional)  
DD has 6 fields (Optional)  
EE has 4 fields (Optional)  
FF has 2 fields (Not needed.  Skipping in output)  
GG has 4 fields (Optional)


AA|12345|ABCDE|67890|FGHIJ|~BB|12345|~CC|ABCDE|12345|~DD|A|B|C|D|E|~EE|1|2|3|~FF|P|~GG|F|R|T
AA|23456|BCDEF|78901|GHIJK|~CC|BCDEF|23456|~EE|2|3|4|~GG|R|F|G
AA|34567|CDEFG|89012|HIJKL|~DD|B|C|D||~FF|Q

Line 1 has no issues because it has all available record types, but lines 2 and 3 do not. So they would need to be modified to include the missing record types. The overall output would need to look something like this:

AA|12345|ABCDE|67890|FGHIJ|~BB|12345|~CC|ABCDE|12345|~DD|A|B|C|D|E|~EE|1|2|3|~GG|F|R|T
AA|23456|BCDEF|78901|GHIJK|~BB||~CC|BCDEF|23456|~DD||||||~EE|2|3|4|~GG|R|F|G
AA|34567|CDEFG|89012|HIJKL|~BB||~CC|||~DD|B|C|D||~EE||||~GG|||

I have started by taking each record, splitting it into its own file, and using:

typeset -i count=0
while read record
do
newfile="`echo $file`.$count.dat"
echo $record | sed 's/|~/\n/g' > $newfile
count=$count+1
done < $file 

to put each record type on its own line within said file, but rolling it back up into one line with all possible fields present is quite tricky. This is obviously not the best way since each file can have several thousand records, which would result in several thousand files, but I was using that as a starting point to get the logic down.

Any thoughts?

回答1:

Here's a executable awk script solution that isn't completely rigorous, but could get you started:

#!/usr/bin/awk -f

BEGIN { FS=OFS="~" }

FNR==NR {
    dflts[$1] = create_empty_field($1,$2)
    if( $3 ~ /req|opt/ ) fld_order[++fld_cnt] = $1
    fld_rule[$1] = $3
    next
}

{
    flds = ""
    j = 1
    for(i=1; i<=fld_cnt; i++) {
        j = skip_flds( j )

        if($j !~ ("^" fld_order[i])) fld = dflts[fld_order[i]]
        else { fld = $j; j++ }
        flds = flds (flds=="" ? "" : OFS) fld
    }
    print flds
}

function create_empty_field(name, cnt,     fld, i) {
    fld = name
    for(i=1; i<=cnt; i++) { fld = fld "|" }
    return( fld )
}

function skip_flds(fnum,     name) {
    name = $fnum
    sub(/\|.*$/, "", name)
    while(fld_rule[name] == "skp") {
        fnum++
        name = $fnum
        sub(/\|.*$/, "", name)
    }
    return( fnum )
}

It takes an additional input file that specifies the defaults for each type of field, which I've called "known_flds"

AA~5~req
BB~2~opt
CC~3~opt
DD~6~opt
EE~4~opt
FF~2~skp
GG~4~opt

which has the same delimiter as the data file because I didn't want to add FS switching in either the script or between the input files. It's an encoding of your field requirements. The final field is shorthand for:

  • req -> Mandatory ( in input or output or both? )
  • opt -> Optional ( only optional in input )
  • skp -> Skip ( in output )

When awk.script is made executable and run like ./awk.script known_flds data, I get the following output:

AA|12345|ABCDE|67890|FGHIJ|~BB|12345|~CC|ABCDE|12345|~DD|A|B|C|D|E|~EE|1|2|3|~GG|F|R|T
AA|23456|BCDEF|78901|GHIJK|~BB||~CC|BCDEF|23456|~DD||||||~EE|2|3|4|~GG|R|F|G
AA|34567|CDEFG|89012|HIJKL|~BB||~CC|||~DD|B|C|D||~EE||||~GG||||

The G field in the questions data doesn't appear to either have the right number of fields specified or are missing a trailing pipe in the input data.

I made at least the following assumptions:

  • Each field in the file is correct - the fields themselves don't need padding
  • The fields are in the correct order, including fields that should be skipped.
  • Any line might be missing the optional fields, and any missing, optional field should appear as an empty field in the output.
  • The field order can be designated from the known_flds file. Otherwise, I might have picked the first line of the file to be complete, in correct field order as well contain as all fields required for the output. That wouldn't allow fields to be called mandatory though.

Here's a simple breakdown of the script:

  • FNR==NR - parse in the original file and create default empty fields using the create_empty_field() function, putting the results in dflts by field name. Create a basic field order, store it in fld_order array. Skipped fields are not put into fld_order, but all field "rules" are added to the fld_rule array.
  • All lines will be checked. Check for fields order and only attempt to print out fld_cnt fields for any record. Any fields past the line count in known_flds won't be output.
  • For any record, skip opt fields and increment j.
  • Build a flds variable with either the current field by $j or if it appears to be missing a field, with an empty field from dflts.
  • Print out flds with the additional, empty fields but without skipped fields.

Here's a breakdown of the functions

create_empty_field():

  • name, cnt are arguments from the first file, while the fld, i are local variables set to empty values for using within the function.
  • set fld to name ( $1 from known_flds )
  • Generate pipes up to cnt value ( $2 from known_flds ).

skip_flds():

  • fnum is the argument for the record field number, while name is a local variable
  • Pull the name part from $fnum
  • Check to see if it should be skipped with fld_rule[name] == "skp" test.
  • If it should be skipped, increment fnum and reset the name variable.
  • I think the repeated name = and sub call lines should really be a new function, but I didn't do that here.

Basically, I'm making parsing/transformation rules in known_flds and then interpretting/enforcing them with awk.script against records in a data file. While this is a reasonable start, you could additionally print errors to another file when manadatory fields are not present or would be empty, add missing subfields to fields, etc. You could get as complicated as you like.