Fixing broken csv files using awk

2020-04-21 08:40发布

问题:

I have some csv files which are broken since there are junk such as control characters, enters and delimiters in some of the fields. An example mockup data without control characters:

id;col 1;col 2;col 3
1;data 11;good 21;data 31
2;data 12;cut
in two;data 32
3;data 13;good 23;data 33
4;data 14;has;extra delimiter;data 34
5;data 15;good 25;data 35
6;data 16;cut
and;extra delimiter;data 36
7;data 17;data 27;data 37
8;data 18;cut
in 
three;data 38
9;data 19;data 29;data 39

I am processing above crap with awk:

BEGIN { FS=OFS=";" }       # delimiters
NR==1 { nf=NF; }           # header record is fine, use the NF
NR>1 {
    if(NF<nf) {            # if NF less that header's NF
        prev=$0            # store $0
        if(getline==1) {   # read the "next" line
            succ=$0        # set the "next" line to succ
            $0=prev succ   # rebuild a current record
        }
    }
    if(NF!=nf)             # if NF is still not adequate
        $0=succ            # expect original line to be malformed
    if(NF!=nf)             # if the "next" line was malformed as well
        next               # well skip "next" line and move to next
} 1

Naturally above program will fail records 4 and 6 (as the actual data has several fields where the extra delimiter may lurk) and 8 (since I only read the next line if NF is too short. I can live with loosing 4 and 6 but 8 might be doable?

Also, three successive ifs scream for a for loop but it's Friday afternoon here and my day is nearing $ and I just can't spin my head around it anymore. Do you guys have any brain reserve left I could borrow? Any best practices I didn't think of?

回答1:

The key her is to keep a buffer containing the lines that are still not "complete"; once they are, print them and clear the buffer:

awk -F';' 'NF>=4 && !nf {print; next}   # normal lines are printed
           {                            # otherwise,
                if (nf>0) {             # continue with a "broken" line by...
                    buff=buff OFS $0      # appending to the buffer
                    nf+=NF-1              # and adding NF
                } else {                # new "broken" line, so...
                    buff=$0               # start buffer
                    nf=NF                 # set number of fields already seen
                }
            }
           nf>=4{                       # once line is complete
              print buff                # print it
              buff=""; nf=0             # and remove variables
           }' file

Here, buff is such buffer and nf an internal counter to keep track of how many fields have been seen so far for the current record (like you did in your attempt).

We are adding NF-1 when appending to the buffer (that is, from the 2nd line of a broken stream) because a line with NF==1 does not add any record but just concatenates with the last field of the previous line:

8;data 18;cut        # NF==3                           |
in                   # NF==1 but it just continues $3  | all together, NF==4
three;data 38        # NF==2 but $1 continues $3       |

With your sample input:

$ awk -F';' 'NF>=4 && !nf {print; next} {buff=(nf>0 ? buff OFS : "") $0; nf+=(nf>0 ? NF-1 : NF)} nf>=4{print buff; buff=""; nf=0}' a
id;col 1;col 2;col 3
1;data 11;good 21;data 31
2;data 12;cut in two;data 32
3;data 13;good 23;data 33
4;data 14;has;extra delimiter;data 34
5;data 15;good 25;data 35
6;data 16;cut and;extra delimiter;data 36
7;data 17;data 27;data 37
8;data 18;cut in  three;data 38
9;data 19;data 29;data 39


标签: csv awk