The input file has up to 34 different record types within the same line.
The file is pipe-delimited, and each record type is separated by '~' (except for the originating record type.
Not all 34 record types are contained on each line, and I do not need all of them.
All record types will be sent within a specified order, but not all record types will always be sent. The first record type is mandatory and will always be sent. Out of the 34 types, there are only 7 that are mandatory.
Each record type has a predefined number of fields and that should never deviate from that definition without proper lead time between the client and our load.
The Oracle table will be constructed with all of the required columns based upon the needed record types. So one row will contain information from each record type similar to the input file, but will additionally include nulls for the columns which would come from certain record types that were not included in the input.
The end result I'm looking for is a way to perform conditional formatting to the input file in order to generate an output that can be simply loaded within a shell script via sqlldr instead of going through PL/SQL (as I want my non-PL/SQL coworkers to be able to troubleshoot/fix any issues encountered during loads).
Small example with 3 records (data types do not matter in this example):
Record Types: AA, BB, CC, DD, EE, FF
AA has 5 fields (Mandatory)
BB has 2 fields (Optional)
CC has 3 fields (Optional)
DD has 6 fields (Optional)
EE has 4 fields (Optional)
FF has 2 fields (Not needed. Skipping in output)
GG has 4 fields (Optional)
AA|12345|ABCDE|67890|FGHIJ|~BB|12345|~CC|ABCDE|12345|~DD|A|B|C|D|E|~EE|1|2|3|~FF|P|~GG|F|R|T
AA|23456|BCDEF|78901|GHIJK|~CC|BCDEF|23456|~EE|2|3|4|~GG|R|F|G
AA|34567|CDEFG|89012|HIJKL|~DD|B|C|D||~FF|Q
Line 1 has no issues because it has all available record types, but lines 2 and 3 do not. So they would need to be modified to include the missing record types. The overall output would need to look something like this:
AA|12345|ABCDE|67890|FGHIJ|~BB|12345|~CC|ABCDE|12345|~DD|A|B|C|D|E|~EE|1|2|3|~GG|F|R|T
AA|23456|BCDEF|78901|GHIJK|~BB||~CC|BCDEF|23456|~DD||||||~EE|2|3|4|~GG|R|F|G
AA|34567|CDEFG|89012|HIJKL|~BB||~CC|||~DD|B|C|D||~EE||||~GG|||
I have started by taking each record, splitting it into its own file, and using:
typeset -i count=0
while read record
do
newfile="`echo $file`.$count.dat"
echo $record | sed 's/|~/\n/g' > $newfile
count=$count+1
done < $file
to put each record type on its own line within said file, but rolling it back up into one line with all possible fields present is quite tricky. This is obviously not the best way since each file can have several thousand records, which would result in several thousand files, but I was using that as a starting point to get the logic down.
Any thoughts?
Here's a executable awk script solution that isn't completely rigorous, but could get you started:
It takes an additional input file that specifies the defaults for each type of field, which I've called "known_flds"
which has the same delimiter as the data file because I didn't want to add
FS
switching in either the script or between the input files. It's an encoding of your field requirements. The final field is shorthand for:When
awk.script
is made executable and run like./awk.script known_flds data
, I get the following output:The
G
field in the questions data doesn't appear to either have the right number of fields specified or are missing a trailing pipe in the input data.I made at least the following assumptions:
known_flds
file. Otherwise, I might have picked the first line of the file to be complete, in correct field order as well contain as all fields required for the output. That wouldn't allow fields to be called mandatory though.Here's a simple breakdown of the script:
FNR==NR
- parse in the original file and create default empty fields using thecreate_empty_field()
function, putting the results indflts
by field name. Create a basic field order, store it infld_order
array. Skipped fields are not put intofld_order
, but all field "rules" are added to thefld_rule
array.fld_cnt
fields for any record. Any fields past the line count inknown_flds
won't be output.opt
fields and incrementj
.flds
variable with either the current field by$j
or if it appears to be missing a field, with an empty field fromdflts
.flds
with the additional, empty fields but without skipped fields.Here's a breakdown of the functions
create_empty_field()
:name, cnt
are arguments from the first file, while thefld, i
are local variables set to empty values for using within the function.fld
toname
($1
fromknown_flds
)cnt
value ($2
fromknown_flds
).skip_flds()
:fnum
is the argument for the record field number, whilename
is a local variablename
part from$fnum
fld_rule[name] == "skp"
test.fnum
and reset thename
variable.name =
andsub
call lines should really be a new function, but I didn't do that here.Basically, I'm making parsing/transformation rules in
known_flds
and then interpretting/enforcing them withawk.script
against records in adata
file. While this is a reasonable start, you could additionally print errors to another file when manadatory fields are not present or would be empty, add missing subfields to fields, etc. You could get as complicated as you like.