How do you parse a CSV file using gawk? Simply setting FS=","
is not enough, as a quoted field with a comma inside will be treated as multiple fields.
Example using FS=","
which does not work:
file contents:
one,two,"three, four",five
"six, seven",eight,"nine"
gawk script:
BEGIN { FS="," }
{
for (i=1; i<=NF; i++) printf "field #%d: %s\n", i, $(i)
printf "---------------------------\n"
}
bad output:
field #1: one
field #2: two
field #3: "three
field #4: four"
field #5: five
---------------------------
field #1: "six
field #2: seven"
field #3: eight
field #4: "nine"
---------------------------
desired output:
field #1: one
field #2: two
field #3: "three, four"
field #4: five
---------------------------
field #1: "six, seven"
field #2: eight
field #3: "nine"
---------------------------
If permissible, I would use the Python csv module, paying special attention to the dialect used and formatting parameters required, to parse the CSV file you have.
Patterns that follow this one can access the fields in Column[]. ColumnCount indicates the number of elements in Column[] that were found. If not all rows contain the same number of columns, Column[] contains extra data after Column[ColumnCount] when processing the shorter rows.
This implementation is slow, but it appears to emulate the
FPAT
/patsplit()
feature found in gawk >= 4.0.0 mentioned in a previous answer.Reference
Here's what I came up with. Any comments and/or better solutions would be appreciated.
The basic idea is that I loop through the fields, and any field which starts with a quote but does not end with a quote gets the next field appended to it.