Can awk deal with CSV file that contains comma ins

2020-02-26 14:31发布

I am using awk to perform counting the sum of one column in the csv file. The data format is something like:

id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99

I was using this awk script to count the sum:

awk -F, '{sum+=$3} END {print sum}'

Some of the value in name field contains comma and this break my awk script. My question is: can awk solve this problem? If yes, and how can I do that?

Thank you.

11条回答
成全新的幸福
2楼-- · 2020-02-26 15:26

Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.

perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new({allow_whitespace => 1})} if($csv->parse($_)){@f=$csv->fields();$sum+=$f[2]} END{print $sum}' file

allow_whitespace is needed since the input data has whitespace surrounding the comma separators. Very old versions of Text::CSV_XS may not support this option.

I provided more explanation of Text::CSV_XS within my answer here: parse csv file using gawk

查看更多
smile是对你的礼貌
3楼-- · 2020-02-26 15:28

you write a function in awk like below:

$ awk 'func isnum(x){return(x==x+0)}BEGIN{print isnum("hello"),isnum("-42")}'
0 1

you can incorporate in your script this function and check whether the third field is numeric or not.if not numeric then go for the 4th field and if the 4th field inturn is not numberic go for 5th ...till you reach a numeric value.probably a loop will help here, and add it to the sum.

查看更多
我命由我不由天
4楼-- · 2020-02-26 15:28

You can help awk work with data fields that contain commas (or newlines) by using a small script I wrote called csvquote. It replaces the offending commas inside quoted fields with nonprinting characters. If you need to, you can later restore those commas - but in this case, you don't need to.

Here is the command:

csvquote inputfile.csv | awk -F, '{sum+=$3} END {print sum}'

see https://github.com/dbro/csvquote for the code

查看更多
淡お忘
5楼-- · 2020-02-26 15:31

I am using

`FPAT="([^,]+)|(\"[^\"]+\")" `

to define the fields with gawk. I found that when the field is null this doesn't recognize correct number of fields. Because "+" requires at least 1 character in the field. I changed it to:

`FPAT="([^,]*)|(\"[^\"]*\")"`

and replace "+" with "*". It works correctly.

I also find that GNU Awk User Guide also has this problem. https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html

查看更多
Juvenile、少年°
6楼-- · 2020-02-26 15:34

FPAT is the elegant solution because it can handle the dreaded commas within quotes problem, but to sum a column of numbers in the last column regardless of the number of preceding separators, $NF works well:

awk -F"," '{sum+=$NF} END {print sum}'

To access the second to last column, you would use this:

awk -F"," '{sum+=$(NF-1)} END {print sum}'

查看更多
登录 后发表回答