How do you parse a CSV file using gawk? Simply setting FS=","
is not enough, as a quoted field with a comma inside will be treated as multiple fields.
Example using FS=","
which does not work:
file contents:
one,two,"three, four",five
"six, seven",eight,"nine"
gawk script:
BEGIN { FS="," }
{
for (i=1; i<=NF; i++) printf "field #%d: %s\n", i, $(i)
printf "---------------------------\n"
}
bad output:
field #1: one
field #2: two
field #3: "three
field #4: four"
field #5: five
---------------------------
field #1: "six
field #2: seven"
field #3: eight
field #4: "nine"
---------------------------
desired output:
field #1: one
field #2: two
field #3: "three, four"
field #4: five
---------------------------
field #1: "six, seven"
field #2: eight
field #3: "nine"
---------------------------
The gawk version 4 manual says to use
FPAT = "([^,]*)|(\"[^\"]+\")"
When
FPAT
is defined, it disablesFS
and specifies fields by content instead of by separator.I am not exactly sure whether this is the right way to do things. I would rather work on a csv file in which either all values are to quoted or none. Btw, awk allows regexes to be Field Separators. Check if that is useful.
The short answer is "I wouldn't use gawk to parse CSV if the CSV contains awkward data", where 'awkward' means things like commas in the CSV field data.
The next question is "What other processing are you going to be doing", since that will influence what alternatives you use.
I'd probably use Perl and the Text::CSV or Text::CSV_XS modules to read and process the data. Remember, Perl was originally written in part as an
awk
andsed
killer - hence thea2p
ands2p
programs still distributed with Perl which convertawk
andsed
scripts (respectively) into Perl.csv2delim.awk
test.csv
test.bat
You can use a simple wrapper function called csvquote to sanitize the input and restore it after awk is done processing it. Pipe your data through it at the start and end, and everything should work out ok:
before:
after:
See https://github.com/dbro/csvquote for code and documentation.
Perl has the Text::CSV_XS module which is purpose-built to handle the quoted-comma weirdness.
Alternately try the Text::CSV module.
perl -MText::CSV_XS -ne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->fields();for $n (0..$#f) {print "field #$n: $f[$n]\n"};print "---\n"}' file.csv
Produces this output:
Here's a human-readable version.
Save it as parsecsv, chmod +x, and run it as "parsecsv file.csv"
You may need to point to a different version of perl on your machine, since the Text::CSV_XS module may not be installed on your default version of perl.
If none of your versions of Perl have Text::CSV_XS installed, you'll need to:
sudo apt-get install cpanminus
sudo cpanm Text::CSV_XS