I use awk
command to count lines with same beginning...
For instance, in try1.txt
, the texts are:
b : c
b : c
When I launch the following command in a terminal:
awk -F ' : ' '$1=="b"{a[$2]++} END{for (i in a) print " ", i,a[i]}' try1.txt
it returns c 2
which is good, because b : c
appears twice in try1.txt
.
The output of my tool is a huge output.txt
, much more complicated than try1.txt
. Some part of output.txt
contains the following characters:
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^137
It is systematically written by the system when a process is killed. I am OK with that. However, I realize that it stops awk
from working well. For example, in try2.txt
as follows:
b : c
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^137
b : c
The command awk -F ' : ' '$1=="b"{a[$2]++} END{for (i in a) print " ", i,a[i]}' try2.txt
returns c 1
. That said, it stopped when it met the odd line ^@^@^@^@^@
.
I don't know how to keep the system from writing the odd line ^@^@^@^@^@
, so does anyone know how to amend awk
command to workaround?
Edit: It seems that the ^@
I found in my output.txt
is not normal characters ^@
. The following is a part of screen shot of output.txt
, displayed in Emacs
, which has trouble:
Edit: As suggested, I run xxd try2.txt
, it gave:
0000000: 6220 3a20 630a 0000 0000 0000 0000 0000 b : c...........
0000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000020: 0000 0000 0000 0000 0031 3337 0a62 203a .........137.b :
0000030: 2063 0a
^@
is likely a representation of a binary 0 / NUL character:Some text-oriented utilities may treat this as an end of file.
So since your input file is a binary file, you should have more luck extracting the text strings from it first and just operating on those:
The
strings
command man page. (btw watch out when you google "man strings" - you might get some images you might not have bargained for ;-) )Note for the curious - I recreated the OP's try1.txt file exactly on my machine thus:
xxd
output on the question to a text file called try1.xxdxxd -r try1.xxd > try1.txt
reverses the normalxxd
operation.If all of the lines you want contain a
:
, you can try putting$0 ~ /:/
as a selector. Here's your new and improved awk statement (I wrote it on separate lines because it's easier for me to keep track of curly braces:This worked as long as the
^@
were on their own line. If not, you have to find out what type of character^@
is. I suspect it's a null character. If so, you may have to remove them from your file:This should remove those bothersome characters. Then, use
try2.txt
for input.Many Awk implementations, and Unix text-processing tools in general, handle null (zero) bytes poorly, because it is the string end terminator of the fundamental C libraries used to build these tools.
Perl was designed to work with arbitrary inputs; you can try
a2p
to convert your Awk script to Perl (but don't expect idiomatic, maintainable, or efficient Perl).Or try this;