Can awk deal with CSV file that contains comma ins

2020-02-26 14:31发布

I am using awk to perform counting the sum of one column in the csv file. The data format is something like:

id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99

I was using this awk script to count the sum:

awk -F, '{sum+=$3} END {print sum}'

Some of the value in name field contains comma and this break my awk script. My question is: can awk solve this problem? If yes, and how can I do that?

Thank you.

11条回答
We Are One
2楼-- · 2020-02-26 15:12

This article did help me solve this same data field issue. Most CSV will put a quote around fields with spaces or commas within them. This messes up the field count for awk unless you filter them out.

If you need the data within those fields that contain the garbage, this is not for you. ghostdog74 provided the answer, which empties that field but maintains the total field count in the end, which is key to keeping the data output consistent. I did not like how this solution introduced new lines. This is the version of this solution I used. The fist three fields never had this problem in the data. The fourth field containing customer name often did, but I needed that data. The remaining fields that exhibit the problem I could throw away without issue because it was not needed in my report output. So I first sed out the 4th field's garbage very specifically and remove the first two instances of quotes. Then I apply what ghostdog74gave to empty the remaining fields that have commas within them - this also removes the quotes, but I use printfto maintain the data in a single record. I start off with 85 fields and end up with 85 fields in all cases from my 8000+ lines of messy data. A perfect score!

grep -i $1 $dbfile | sed 's/\, Inc.//;s/, LLC.//;s/, LLC//;s/, Ltd.//;s/\"//;s/\"//' | awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}' > $tmpfile

The solution that empties the fields with commas within them but also maintains the record, of course is:

awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}

Megs of thanks to ghostdog74 for the great solution!

NetsGuy256/

查看更多
【Aperson】
3楼-- · 2020-02-26 15:14

If you know for sure that the 'value' column is always the last column:

awk -F, '{sum+=$NF} END {print sum}'

NF represents the number of fields, so $NF is the last column

查看更多
走好不送
4楼-- · 2020-02-26 15:15

You're probably better off doing it in perl with Text::CSV, since that's a fast and robust solution.

查看更多
够拽才男人
5楼-- · 2020-02-26 15:15

For as simple an input file as that you can just write a small function to convert all of the real FSs outside of the quotes to some other value (I chose RS since the record separator cannot be part of the record) and then use that as the FS, e.g.:

$ cat decsv.awk
BEGIN{ fs=FS; FS=RS }

{
   decsv()

   for (i=1;i<=NF;i++) {
       printf "Record %d, Field %d is <%s>\n" ,NR,i,$i
   }
   print ""
}

function decsv(         curr,head,tail)
{
   tail = $0
   while ( match(tail,/"[^"]+"/) ) {
       head = substr(tail, 1, RSTART-1);
       gsub(fs,RS,head)
       curr = curr head substr(tail, RSTART, RLENGTH)
       tail = substr(tail, RSTART + RLENGTH)
   }
   gsub(fs,RS,tail)
   $0 = curr tail
}

$ cat file
id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99

$ awk -F", " -f decsv.awk file
Record 1, Field 1 is <id>
Record 1, Field 2 is <name>
Record 1, Field 3 is <value>

Record 2, Field 1 is <1>
Record 2, Field 2 is <foo>
Record 2, Field 3 is <17>

Record 3, Field 1 is <2>
Record 3, Field 2 is <bar>
Record 3, Field 3 is <76>

Record 4, Field 1 is <3>
Record 4, Field 2 is <"I am the, question">
Record 4, Field 3 is <99>

It only becomes complicated when you have to deal with embedded newlines and embedded escaped quotes within the quotes and even then it's not too hard and it's all been done before...

See What's the most robust way to efficiently parse CSV using awk? for more information.

查看更多
三岁会撩人
6楼-- · 2020-02-26 15:20

You can always tackle the problem from the source. Put quotes around the name field, just like the field of "I am the, question". This is much easier than spending your time coding workarounds for that.

Update(as Dennis requested). A simple example

$ s='id, "name1,name2", value 1, foo, 17 2, bar, 76 3, "I am the, question", 99'

$ echo $s|awk -F'"' '{ for(i=1;i<=NF;i+=2) print $i}'
id,
, value 1, foo, 17 2, bar, 76 3,
, 99

$ echo $s|awk -F'"' '{ for(i=2;i<=NF;i+=2) print $i}'
name1,name2
I am the, question

As you can see, by setting the delimiter to double quote, the fields that belong to the "quotes" are always on even number. Since OP doesn't have the luxury of modifying the source data, this method will not be appropriate to him.

查看更多
We Are One
7楼-- · 2020-02-26 15:25

One way using GNU awk and FPAT

awk 'BEGIN { FPAT = "([^, ]+)|(\"[^\"]+\")" } { sum+=$3 } END { print sum }' file.txt

Result:

192
查看更多
登录 后发表回答