Can awk deal with CSV file that contains comma ins

I am using awk to perform counting the sum of one column in the csv file. The data format is something like:

id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99

I was using this awk script to count the sum:

awk -F, '{sum+=$3} END {print sum}'

Some of the value in name field contains comma and this break my awk script. My question is: can awk solve this problem? If yes, and how can I do that?

Thank you.

标签： csv awk field text-parsing quoting

11条回答

We Are One

2楼-- · 2020-02-26 15:12

This article did help me solve this same data field issue. Most CSV will put a quote around fields with spaces or commas within them. This messes up the field count for awk unless you filter them out.

If you need the data within those fields that contain the garbage, this is not for you. ghostdog74 provided the answer, which empties that field but maintains the total field count in the end, which is key to keeping the data output consistent. I did not like how this solution introduced new lines. This is the version of this solution I used. The fist three fields never had this problem in the data. The fourth field containing customer name often did, but I needed that data. The remaining fields that exhibit the problem I could throw away without issue because it was not needed in my report output. So I first sed out the 4th field's garbage very specifically and remove the first two instances of quotes. Then I apply what ghostdog74gave to empty the remaining fields that have commas within them - this also removes the quotes, but I use printfto maintain the data in a single record. I start off with 85 fields and end up with 85 fields in all cases from my 8000+ lines of messy data. A perfect score!

grep -i $1 $dbfile | sed 's/\, Inc.//;s/, LLC.//;s/, LLC//;s/, Ltd.//;s/\"//;s/\"//' | awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}' > $tmpfile

The solution that empties the fields with commas within them but also maintains the record, of course is:

awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}

Megs of thanks to ghostdog74 for the great solution!

NetsGuy256/

0人赞添加讨论(0) 举报

【Aperson】

3楼-- · 2020-02-26 15:14

If you know for sure that the 'value' column is always the last column:

awk -F, '{sum+=$NF} END {print sum}'

NF represents the number of fields, so $NF is the last column

0人赞添加讨论(0) 举报

走好不送

4楼-- · 2020-02-26 15:15

You're probably better off doing it in perl with Text::CSV, since that's a fast and robust solution.

0人赞添加讨论(0) 举报

够拽才男人

5楼-- · 2020-02-26 15:15

For as simple an input file as that you can just write a small function to convert all of the real FSs outside of the quotes to some other value (I chose RS since the record separator cannot be part of the record) and then use that as the FS, e.g.:

$ cat decsv.awk
BEGIN{ fs=FS; FS=RS }

{
   decsv()

   for (i=1;i<=NF;i++) {
       printf "Record %d, Field %d is <%s>\n" ,NR,i,$i
   }
   print ""
}

function decsv(         curr,head,tail)
{
   tail = $0
   while ( match(tail,/"[^"]+"/) ) {
       head = substr(tail, 1, RSTART-1);
       gsub(fs,RS,head)
       curr = curr head substr(tail, RSTART, RLENGTH)
       tail = substr(tail, RSTART + RLENGTH)
   }
   gsub(fs,RS,tail)
   $0 = curr tail
}

$ cat file
id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99

$ awk -F", " -f decsv.awk file
Record 1, Field 1 is <id>
Record 1, Field 2 is <name>
Record 1, Field 3 is <value>

Record 2, Field 1 is <1>
Record 2, Field 2 is <foo>
Record 2, Field 3 is <17>

Record 3, Field 1 is <2>
Record 3, Field 2 is <bar>
Record 3, Field 3 is <76>

Record 4, Field 1 is <3>
Record 4, Field 2 is <"I am the, question">
Record 4, Field 3 is <99>

It only becomes complicated when you have to deal with embedded newlines and embedded escaped quotes within the quotes and even then it's not too hard and it's all been done before...

See What's the most robust way to efficiently parse CSV using awk? for more information.

0人赞添加讨论(0) 举报

三岁会撩人

6楼-- · 2020-02-26 15:20

You can always tackle the problem from the source. Put quotes around the name field, just like the field of "I am the, question". This is much easier than spending your time coding workarounds for that.

Update(as Dennis requested). A simple example

$ s='id, "name1,name2", value 1, foo, 17 2, bar, 76 3, "I am the, question", 99'

$ echo $s|awk -F'"' '{ for(i=1;i<=NF;i+=2) print $i}'
id,
, value 1, foo, 17 2, bar, 76 3,
, 99

$ echo $s|awk -F'"' '{ for(i=2;i<=NF;i+=2) print $i}'
name1,name2
I am the, question

As you can see, by setting the delimiter to double quote, the fields that belong to the "quotes" are always on even number. Since OP doesn't have the luxury of modifying the source data, this method will not be appropriate to him.

0人赞添加讨论(0) 举报

We Are One

7楼-- · 2020-02-26 15:25

One way using GNU awk and FPAT

awk 'BEGIN { FPAT = "([^, ]+)|(\"[^\"]+\")" } { sum+=$3 } END { print sum }' file.txt

Result:

0人赞添加讨论(0) 举报

1 2 下一页

Can awk deal with CSV file that contains comma ins

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间