How would an awk
script (presumably a one-liner) for removing a BOM look like?
Specification:
- print every line after the first (
NR > 1
) - for the first line: If it starts with
#FE #FF
or#FF #FE
, remove those and print the rest
How would an awk
script (presumably a one-liner) for removing a BOM look like?
Specification:
NR > 1
)#FE #FF
or #FF #FE
, remove those and print the rest
I know the question was directed at unix/linux, thought it would be worth to mention a good option for the unix-challenged (on windows, with a UI).
I ran into the same issue on a WordPress project (BOM was causing problems with rss feed and page validation) and I had to look into all the files in a quite big directory tree to find the one that was with BOM. Found an application called Replace Pioneer and in it:
Batch Runner -> Search (to find all the files in the subfolders) -> Replace Template -> Binary remove BOM (there is a ready made search and replace template for this).
It was not the most elegant solution and it did require installing a program, which is a downside. But once I found out what was going around me, it worked like a charm (and found 3 files out of about 2300 that were with BOM).
Not awk, but simpler:
To check for BOM:
If BOM is present you'll see:
00000000 ef bb bf ...
In addition to converting CRLF line endings to LF,
dos2unix
also removes BOMs:dos2unix
also converts UTF-16 files with a BOM (but not UTF-16 files without a BOM) to UTF-8 without a BOM:Using GNU
sed
(on Linux or Cygwin):On FreeBSD:
Advantage of using GNU or FreeBSD
sed
: the-i
parameter means "in place", and will update files without the need for redirections or weird tricks.On Mac:
This
awk
solution in another answer works, but thesed
command above does not work. At least on Mac (Sierra)sed
documentation does not mention supporting hexadecimal escaping ala\xef
.A similar trick can be achieved with any program by piping to the
sponge
tool from moreutils:Try this:
On the first record (line), remove the BOM characters. Print every record.
Or slightly shorter, using the knowledge that the default action in awk is to print the record:
1
is the shortest condition that always evaluates to true, so each record is printed.Enjoy!
-- ADDENDUM --
Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:
Thus, you can see how
\xef\xbb\xbf
corresponds toEF BB BF
UTF-8
BOM bytes from the above table.