I looked around, but I can't find anyone who has already answered this.
I am writing a bash script which will read 6 different csv
files, and count how many lines in all the files together have certain tags in them.
(it's a contact list database - and there are tags for business or private email address)
Here is an example of the code I have written:
### First Scan - Business emails ###
bus="$(awk 'BEGIN {FS = ","}{print $27}' FILE*full* | grep -c "Business")"
echo "No. of Business Accounts: $bus"
### Second Scan - Private emails ###
priv="$(awk 'BEGIN {FS = ","}{print $27}' FILE*full* | grep -c "Private")"
echo "No. of Private Accounts: $priv"
The script returns values which seem perfectly correct. HOWEVER! I know for a fact that every line in every file HAS the tag 'business' or 'private' in the same position - and there are no empty lines but when I add the two results together, it does not equal the full number of lines... There are about 45000 missing...
Are there any limitations to the stdin
for grep
or awk
- the database in full is over 2 million lines long...
Please help! :-)
The most likely is that some of those millions of lines of csv include a quoted field with a comma in it. Awk knows nothing about quotes; it will just split on commas.
If you are using Gnu awk, you can use the
FPAT
variable, which lets you specify a regular expression for the fields rather than a regular expression for the field separators. For example, this will work for many CSV files (aside from the line end issue if the csv file uses CR-LF line endings). (-v var=value
is roughly equivalent toBEGIN{var="value"}
, and not just in Gnu awk.)By the way, there's no need to use
grep
as well asawk
. You can filter and count with awk; in fact, you can do both counts in the same scan:The regular expression above is quite simple, and it will not deal with "incorrect" CSV files (if you can use that word for such a loosely defined format). It matches:
So the first alternative will match unquoted fields, like
93.7
orPrivate
, and the second alternative will match:Quoted fields, possibly including commas:
"Blood, sweat and tears"
Quoted fields with internal quotes as per the quote-doubling rule:
"""My goodness,"" she said"
(See section 2.7 of RFC 4180.)It doesn't attempt to match backslash-escaped quotes, which are not part of the standard (nor are they produced by MS Excel, afaik), and it will fail completely if a quoted field erroneously includes an undoubled quote.
You can use a simple variant on the above program to view the lines which are not being parsed correctly, which might let you fix them, or adapt the regular expression, something like:
Try this and tell us what output you get:
NR in the output above represents the total records and should equal bus + priv + other.