I have some csv files which are broken since there are junk such as control characters, enters and delimiters in some of the fields. An example mockup data without control characters:
id;col 1;col 2;col 3
1;data 11;good 21;data 31
2;data 12;cut
in two;data 32
3;data 13;good 23;data 33
4;data 14;has;extra delimiter;data 34
5;data 15;good 25;data 35
6;data 16;cut
and;extra delimiter;data 36
7;data 17;data 27;data 37
8;data 18;cut
in
three;data 38
9;data 19;data 29;data 39
I am processing above crap with awk:
BEGIN { FS=OFS=";" } # delimiters
NR==1 { nf=NF; } # header record is fine, use the NF
NR>1 {
if(NF<nf) { # if NF less that header's NF
prev=$0 # store $0
if(getline==1) { # read the "next" line
succ=$0 # set the "next" line to succ
$0=prev succ # rebuild a current record
}
}
if(NF!=nf) # if NF is still not adequate
$0=succ # expect original line to be malformed
if(NF!=nf) # if the "next" line was malformed as well
next # well skip "next" line and move to next
} 1
Naturally above program will fail records 4
and 6
(as the actual data has several fields where the extra delimiter may lurk) and 8
(since I only read the next line if NF
is too short. I can live with loosing 4
and 6
but 8
might be doable?
Also, three successive if
s scream for a for
loop but it's Friday afternoon here and my day is nearing $
and I just can't spin my head around it anymore. Do you guys have any brain reserve left I could borrow? Any best practices I didn't think of?