grep/awk stdin Limitations?

2019-07-31 13:46发布

I looked around, but I can't find anyone who has already answered this.

I am writing a bash script which will read 6 different csv files, and count how many lines in all the files together have certain tags in them.

(it's a contact list database - and there are tags for business or private email address)

Here is an example of the code I have written:

### First Scan - Business emails ###

bus="$(awk 'BEGIN {FS = ","}{print $27}' FILE*full* | grep -c "Business")"

echo "No. of Business Accounts: $bus"

### Second Scan - Private emails ###

priv="$(awk 'BEGIN {FS = ","}{print $27}' FILE*full* | grep -c "Private")"

echo "No. of Private Accounts: $priv"

The script returns values which seem perfectly correct. HOWEVER! I know for a fact that every line in every file HAS the tag 'business' or 'private' in the same position - and there are no empty lines but when I add the two results together, it does not equal the full number of lines... There are about 45000 missing...

Are there any limitations to the stdin for grep or awk - the database in full is over 2 million lines long...

Please help! :-)

2条回答
三岁会撩人
2楼-- · 2019-07-31 14:23

The most likely is that some of those millions of lines of csv include a quoted field with a comma in it. Awk knows nothing about quotes; it will just split on commas.

If you are using Gnu awk, you can use the FPAT variable, which lets you specify a regular expression for the fields rather than a regular expression for the field separators. For example, this will work for many CSV files (aside from the line end issue if the csv file uses CR-LF line endings). (-v var=value is roughly equivalent to BEGIN{var="value"}, and not just in Gnu awk.)

gawk -v FPAT='[^",][^,]*|("[^"]*")*'  

By the way, there's no need to use grep as well as awk. You can filter and count with awk; in fact, you can do both counts in the same scan:

gawk -v FPAT='[^",][^,]*|("[^"]*")*' '
     $27 ~ /Business/ {++bus}
     $27 ~ /Private/  {++pri}
     END { print "No. of Business accounts", bus
           print "No. of Private accounts", pri}' FILE*full* 

The regular expression above is quite simple, and it will not deal with "incorrect" CSV files (if you can use that word for such a loosely defined format). It matches:

[^",][^,]*|("[^"]*")*
  |    |  | |  |  | |
  +----+--+-+--+--+-+----- A character other than quote or comma
       |  | |  |  | |
       +--+-+--+--+-+----- Followed by any number of characters other than comma
          | |  |  | |
          +-+--+--+-+--- OR
            |  |  | |
            |  |  | +----- Any number of sequences consisting of
            |  |  |
            +--+--+--------- A quote
               |  |
               +--+--------- Any number of characters other than a quote
                  |
                  +--------- Another quote

So the first alternative will match unquoted fields, like 93.7 or Private, and the second alternative will match:

  • Quoted fields, possibly including commas: "Blood, sweat and tears"

  • Quoted fields with internal quotes as per the quote-doubling rule: """My goodness,"" she said" (See section 2.7 of RFC 4180.)

It doesn't attempt to match backslash-escaped quotes, which are not part of the standard (nor are they produced by MS Excel, afaik), and it will fail completely if a quoted field erroneously includes an undoubled quote.

You can use a simple variant on the above program to view the lines which are not being parsed correctly, which might let you fix them, or adapt the regular expression, something like:

gawk -v FPAT='[^",][^,]*|("[^"]*")*' '
     $27 !~ /Business/ && $27 !~ /Private/ {
           print "----"
           print "Error at line " NR:
           print $0
           for (i=1; i<=NF; ++i) printf "%2d: |%s|\n", i, $i
     }' filename
查看更多
Root(大扎)
3楼-- · 2019-07-31 14:32

Try this and tell us what output you get:

awk -F',' '
$27 ~ /Business/ { bus++; next }
$27 ~ /Private/  { priv++; next }
{ other++; print "Non-Business/Private:", FILENAME, FNR, $27 }
END { print NR, bus, priv, other }
' FILE*full*

NR in the output above represents the total records and should equal bus + priv + other.

查看更多
登录 后发表回答