I am using awk to perform counting the sum of one column in the csv file. The data format is something like:
id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99
I was using this awk script to count the sum:
awk -F, '{sum+=$3} END {print sum}'
Some of the value in name field contains comma and this break my awk script. My question is: can awk solve this problem? If yes, and how can I do that?
Thank you.
Fully fledged CSV parsers such as Perl's
Text::CSV_XS
are purpose-built to handle that kind of weirdness.perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new({allow_whitespace => 1})} if($csv->parse($_)){@f=$csv->fields();$sum+=$f[2]} END{print $sum}' file
allow_whitespace
is needed since the input data has whitespace surrounding the comma separators. Very old versions ofText::CSV_XS
may not support this option.I provided more explanation of
Text::CSV_XS
within my answer here: parse csv file using gawkyou write a function in awk like below:
you can incorporate in your script this function and check whether the third field is numeric or not.if not numeric then go for the 4th field and if the 4th field inturn is not numberic go for 5th ...till you reach a numeric value.probably a loop will help here, and add it to the sum.
You can help awk work with data fields that contain commas (or newlines) by using a small script I wrote called csvquote. It replaces the offending commas inside quoted fields with nonprinting characters. If you need to, you can later restore those commas - but in this case, you don't need to.
Here is the command:
see https://github.com/dbro/csvquote for the code
I am using
to define the fields with gawk. I found that when the field is null this doesn't recognize correct number of fields. Because "+" requires at least 1 character in the field. I changed it to:
and replace
"+"
with"*"
. It works correctly.I also find that GNU Awk User Guide also has this problem. https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html
FPAT is the elegant solution because it can handle the dreaded commas within quotes problem, but to sum a column of numbers in the last column regardless of the number of preceding separators, $NF works well:
awk -F"," '{sum+=$NF} END {print sum}'
To access the second to last column, you would use this:
awk -F"," '{sum+=$(NF-1)} END {print sum}'