Count number of column in a pipe delimited file

2019-06-25 23:04发布

问题:

I have a pipe | delimited file.

File:

106232145|"medicare"|"medicare,medicaid"|789

I would like to count the number of fields in each line. I tried the below code

Code:

awk -F '|' '{print NF-1}'

This returns me the result as 5 instead of 4. This is because the awk takes "medicare|medicaid" as two different fields instead of one field

回答1:

awk -F\| '{print NF}'

gives correct result.



回答2:

Pure Unix solution (without awk/Perl):

$ cat  /tmp/x1
1|2|3|34
4534|23442|1121|334434

$ head -1 /tmp/x1 | tr "|" "\012" | wc -l
4

Perl solution - 1-liner:

$ perl5.8 -naF'\|' -e 'print scalar(@F)."\n";exit;' /tmp/x1
4

BUT!!!! IMPORTANT!!!

Every one of these solutions - as well as those on other answers - do NOT work 100%!

Namely, they all break when it's a REAL "pipe-separated" file, with a pipe being a valid character in the field (and the field being quoted), the way real CSV files work.

E.g.

$ cat /tmp/x2
"0|1"|2|3|34
4534|23442|1121|334434
$ perl5.8 -naF'\|' -e 'print scalar(@F)."\n";exit;' /tmp/x1
5   <----- BROKEN!!! There are only 4 fields, first field is "0|1"

To fix that, a proper CSV (or delimited file) parser should be used, such as one in Perl:

$ perl5.8 -MText::CSV_XS 
-ne '$csv=Text::CSV_XS->new({sep_char => "|"});  $csv->parse($_); 
print $csv->fields(); print "\n"; exit;' /tmp/x2

Prints correct value

4

As a note, simply fixing an awk or sed solution with a convoluted RegEx won't work easily, since on top of pipe-containing-and-quoted PSV fields, the spec also allows quotes as part of the field as well. That does NOT lend itself to a nice RegEx solution.



回答3:

$ cat fieldparse.awk
#NR > 1 { print "--"; }

# Uncomment printf/print in the for loops to see
#   each field on a separate line as well as the commented line above (to show that it works).
{
    nfields = 0;
    for (i = 1; i <= NF; i++) {
        if ($i ~ /^".*[^"]$/)
            for (; i <= NF && ($i !~ /.*"$/); i++) {
                #printf("%s%s", $i, FS);
            }
        #print $i;
        nfields++;
    }
    print nfields;
    if (FILENAME == "-")
        FILENAME = "(standard input)";
    filenames[FILENAME] = sprintf("%d %d", FNR, nfields);
}

END {
    print NR, "total records processed";
    for (f in filenames) {
        split(filenames[f], fn, " ");
        printf("\t* %s: %d records with %d fields\n", f, fn[1], fn[2]);
    }
}

$ awk -F'|' -f fieldparse.awk demo.txt

It works for any single character separator that is NOT a double quotation mark, meaning standard tab delimited, CSV, etc. formats (as standard as they get anyway...)

The output format is merely illustrative and a bit decorative at the end, but the content is still useful IMHO, such as handling multiple files. In any case, I hope it helps! :-)

Edit

This was tested using mawk and GNU awk (gawk), the latter of which was tested in traditional, POSIX and the default modes. Trim the comments and output statements to find it actually a small program, though it isn't as small as one might like.



回答4:

For a | delimited file with embedded | in between this GNU awk v4.0 or later should work:

gawk '{ print NF }' FPAT="([^|]+)|(\"[^\"]+\")"


回答5:

perl -ne 'print scalar( split( /\|/, $_ ) ) . "\n"' [filename]