I have a pipe |
delimited file.
File:
106232145|"medicare"|"medicare,medicaid"|789
I would like to count the number of fields in each line. I tried the below code
Code:
awk -F '|' '{print NF-1}'
This returns me the result as 5 instead of 4. This is because the awk takes "medicare|medicaid" as two different fields instead of one field
Pure Unix solution (without awk/Perl):
$ cat /tmp/x1
1|2|3|34
4534|23442|1121|334434
$ head -1 /tmp/x1 | tr "|" "\012" | wc -l
4
Perl solution - 1-liner:
$ perl5.8 -naF'\|' -e 'print scalar(@F)."\n";exit;' /tmp/x1
4
BUT!!!! IMPORTANT!!!
Every one of these solutions - as well as those on other answers - do NOT work 100%!
Namely, they all break when it's a REAL "pipe-separated" file, with a pipe being a valid character in the field (and the field being quoted), the way real CSV files work.
E.g.
$ cat /tmp/x2
"0|1"|2|3|34
4534|23442|1121|334434
$ perl5.8 -naF'\|' -e 'print scalar(@F)."\n";exit;' /tmp/x1
5 <----- BROKEN!!! There are only 4 fields, first field is "0|1"
To fix that, a proper CSV (or delimited file) parser should be used, such as one in Perl:
$ perl5.8 -MText::CSV_XS
-ne '$csv=Text::CSV_XS->new({sep_char => "|"}); $csv->parse($_);
print $csv->fields(); print "\n"; exit;' /tmp/x2
Prints correct value
4
As a note, simply fixing an awk
or sed
solution with a convoluted RegEx won't work easily, since on top of pipe-containing-and-quoted PSV fields, the spec also allows quotes as part of the field as well. That does NOT lend itself to a nice RegEx solution.
$ cat fieldparse.awk
#NR > 1 { print "--"; }
# Uncomment printf/print in the for loops to see
# each field on a separate line as well as the commented line above (to show that it works).
{
nfields = 0;
for (i = 1; i <= NF; i++) {
if ($i ~ /^".*[^"]$/)
for (; i <= NF && ($i !~ /.*"$/); i++) {
#printf("%s%s", $i, FS);
}
#print $i;
nfields++;
}
print nfields;
if (FILENAME == "-")
FILENAME = "(standard input)";
filenames[FILENAME] = sprintf("%d %d", FNR, nfields);
}
END {
print NR, "total records processed";
for (f in filenames) {
split(filenames[f], fn, " ");
printf("\t* %s: %d records with %d fields\n", f, fn[1], fn[2]);
}
}
$ awk -F'|' -f fieldparse.awk demo.txt
It works for any single character separator that is NOT a double quotation mark, meaning standard tab delimited, CSV, etc. formats (as standard as they get anyway...)
The output format is merely illustrative and a bit decorative at the end, but the content is still useful IMHO, such as handling multiple files. In any case, I hope it helps! :-)
Edit
This was tested using mawk and GNU awk (gawk), the latter of which was tested in traditional, POSIX and the default modes. Trim the comments and output statements to find it actually a small program, though it isn't as small as one might like.
For a |
delimited file with embedded |
in between this GNU awk v4.0
or later should work:
gawk '{ print NF }' FPAT="([^|]+)|(\"[^\"]+\")"
perl -ne 'print scalar( split( /\|/, $_ ) ) . "\n"'
[filename]