Identifying and removing null characters in UNIX

2019-01-06 09:45发布

问题:

I have a text file containing unwanted null characters (ASCII NUL, \0). When I try to view it in vi I see ^@ symbols, interleaved in normal text. How can I:

  1. Identify which lines in the file contain null characters? I have tried grepping for \0 and \x0, but this did not work.

  2. Remove the null characters? Running strings on the file cleaned it up, but I'm just wondering if this is the best way?

回答1:

I’d use tr:

tr < file-with-nulls -d '\000' > file-without-nulls

If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<, >, …) anywhere in the command line, actually.



回答2:

Use the following sed command for removing the null characters in a file.

sed -i 's/\x0//g' null.txt

this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.



回答3:

A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv to convert it to UTF-8.



回答4:

I discovered the following, which prints out which lines, if any, have null characters:

perl -ne '/\000/ and print;' file-with-nulls

Also, an octal dump can tell you if there are nulls:

od file-with-nulls | grep ' 000'


回答5:

If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.

tr -d '\n\000' <infile | tr '\r' '\n' >outfile


回答6:

Here is example how to remove NULL characters using ex (in-place):

ex -s +"%s/\%x00//g" -cwq nulls.txt

and for multiple files:

ex -s +'bufdo!%s/\%x00//g' -cxa *.txt

For recursivity, you may use globbing option **/*.txt (if it is supported by your shell).

Useful for scripting since sed and its -i parameter is a non-standard BSD extension.

See also: How to check if the file is a binary file and read all the files which are not?



回答7:

I used:

recode UTF-16..UTF-8 <filename>

to get rid of zeroes in file.



回答8:

I faced the same error with:

import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')

I solved the problem by changing the encoding to utf-16

f=cd.open(filePath,'r','utf-16')