An odd text stops awk command from working

2019-07-31 13:34发布

问题:

I use awk command to count lines with same beginning...

For instance, in try1.txt, the texts are:

b : c
b : c

When I launch the following command in a terminal:

awk -F ' : ' '$1=="b"{a[$2]++} END{for (i in a) print "  ", i,a[i]}' try1.txt

it returns c 2 which is good, because b : c appears twice in try1.txt.

The output of my tool is a huge output.txt, much more complicated than try1.txt. Some part of output.txt contains the following characters:

^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^137

It is systematically written by the system when a process is killed. I am OK with that. However, I realize that it stops awk from working well. For example, in try2.txt as follows:

b : c
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^137
b : c

The command awk -F ' : ' '$1=="b"{a[$2]++} END{for (i in a) print " ", i,a[i]}' try2.txt returns c 1. That said, it stopped when it met the odd line ^@^@^@^@^@.

I don't know how to keep the system from writing the odd line ^@^@^@^@^@, so does anyone know how to amend awk command to workaround?

Edit: It seems that the ^@ I found in my output.txt is not normal characters ^@. The following is a part of screen shot of output.txt, displayed in Emacs, which has trouble:

Edit: As suggested, I run xxd try2.txt, it gave:

0000000: 6220 3a20 630a 0000 0000 0000 0000 0000  b : c...........
0000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000020: 0000 0000 0000 0000 0031 3337 0a62 203a  .........137.b :
0000030: 2063 0a  

回答1:

^@ is likely a representation of a binary 0 / NUL character:

$ head -c10 /dev/zero > 10zero
$ cat -v 10zero 
^@^@^@^@^@^@^@^@^@^@$ 

Some text-oriented utilities may treat this as an end of file.

So since your input file is a binary file, you should have more luck extracting the text strings from it first and just operating on those:

$ strings try1.txt | awk -F ' : ' '$1=="b"{a[$2]++} END{for (i in a) print "  ", i,a[i]}'
   c 2
$

The strings command man page. (btw watch out when you google "man strings" - you might get some images you might not have bargained for ;-) )


Note for the curious - I recreated the OP's try1.txt file exactly on my machine thus:

  • capture the xxd output on the question to a text file called try1.xxd
  • xxd -r try1.xxd > try1.txt reverses the normal xxd operation.


回答2:

Many Awk implementations, and Unix text-processing tools in general, handle null (zero) bytes poorly, because it is the string end terminator of the fundamental C libraries used to build these tools.

Perl was designed to work with arbitrary inputs; you can try a2p to convert your Awk script to Perl (but don't expect idiomatic, maintainable, or efficient Perl).

Or try this;

perl -lne '$a{$1}++ if (/^b : (.*?)\s*$/);  
    END { for $i (keys %a) { print " ", $i, " ", $a{$i} } }' try1.txt


回答3:

If all of the lines you want contain a :, you can try putting $0 ~ /:/ as a selector. Here's your new and improved awk statement (I wrote it on separate lines because it's easier for me to keep track of curly braces:

$ awk -F ' : ' '
{
    if ( $0 ~ /:/ && $1 == "b" )  {
    a[$2]++
    }
}
END {
    for (i in a) { 
    print "  ", i,a[i]
    }
}' try.txt

This worked as long as the ^@ were on their own line. If not, you have to find out what type of character ^@ is. I suspect it's a null character. If so, you may have to remove them from your file:

$ tr -d \0 < try.txt > try2.txt

This should remove those bothersome characters. Then, use try2.txt for input.



标签: bash shell awk