The intent of this question is to provide a canonical answer.
Given a CSV as might be generated by Excel or other tools with embedded newlines, embedded double quotes and empty fields like:
$ cat file.csv
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1
fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
What's the most robust way efficiently using awk to identify the separate records and fields:
Record 1:
$1=<rec1, fld1>
$2=<>
$3=<rec1","fld3.1
",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
so it can be used as those records and fields internally by the rest of the awk script.
A valid CSV would be one that conforms to RFC 4180 or can be generated by MS-Excel.
The solution must tolerate the end of record just being LF (\n
) as is typical for UNIX files rather than CRLF (\r\n
) as that standard requires and Excel or other Windows tools would generate. It will also tolerate unquoted fields mixed with quoted fields. It will specifically not need to tolerate escaping "
s with a preceding backslash (i.e. \"
instead of ""
) as some other CSV formats allow - if you have that then adding a gsub(/\\"/,"\"\"")
up front would handle it and trying to handle both escaping mechanisms automatically in one script would make the script unnecessarily fragile and complicated.
If your CSV cannot contain newlines or escaped double quotes then all you need is (with GNU awk for
FPAT
):Otherwise, though, the more general, robust, portable solution that will work with any modern awk is:
.
The above assumes UNIX line endings of
\n
. With Windows\r\n
line endings it's much simpler as the "newlines" within each field will actually just be line feeds (i.e.\n
s) and so you can setRS="\r\n"
and then the\n
s within fields will not be treated as line endings.It works by simply counting how many
"
s are present so far in the current record whenever it encounters theRS
- if it's an odd number then theRS
(presumably\n
but doesn't have to be) is mid-field and so we keep building the current record but if it's even then it's the end of the current record and so we can continue with the rest of the script processing the now complete record.The
gsub(/@/,"@A"); gsub(/""/,"@B")
converts every pair of double quotes axcross the whole record (bear in mind these""
pairs can only apply within quoted fields) to a string@B
that does not contain a double quote so that when we split the record into fields the match() doesn't get tripped up by quotes appearing inside fields. Thegsub(/@B/,"\"",$i); gsub(/@A/,"@",$i)
restores the quotes inside each field individually and also converts the""
s to the"
s they really represent.