When I use iconv to convert from UTF16 to UTF8 then all is fine but vice versa it does not work. I have these files:
a-16.strings: Little-endian UTF-16 Unicode c program text
a-8.strings: UTF-8 Unicode c program text, with very long lines
The text look OK in editor. When I run this:
iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16.strings
Then I get this result:
b-16.strings: data
a-16.strings: Little-endian UTF-16 Unicode c program text
a-8.strings: UTF-8 Unicode c program text, with very long lines
The file
utility does not show expected file format and the text does not look good in editor either. Could it be that iconv does not create proper BOM? I run it on MAC command line.
Why is not the b-16 in proper UTF-16LE format? Is there another way of converting utf8 to utf16?
More elaboration is bellow.
$ iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16le-BAD-fromUTF8.strings
$ iconv -f UTF-8 -t UTF-16 a-8.strings > b-16be.strings
$ iconv -f UTF-16 -t UTF-16LE b-16be.strings > b-16le-BAD-fromUTF16BE.strings
$ file *s
a-16.strings: Little-endian UTF-16 Unicode c program text, with very long lines
a-8.strings: UTF-8 Unicode c program text, with very long lines
b-16be.strings: Big-endian UTF-16 Unicode c program text, with very long lines
b-16le-BAD-fromUTF16BE.strings: data
b-16le-BAD-fromUTF8.strings: data
$ od -c a-16.strings | head
0000000 377 376 / \0 * \0 \0 \f 001 E \0 S \0 K \0
$ od -c a-8.strings | head
0000000 / * * * Č ** E S K Y ( J V O
$ od -c b-16be.strings | head
0000000 376 377 \0 / \0 * \0 * \0 * \0 001 \f \0 E
$ od -c b-16le-BAD-fromUTF16BE.strings | head
0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0
$ od -c b-16le-BAD-fromUTF8.strings | head
0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0
It is clear the BOM is missing whenever I run conversion to UTF-16LE. Any help on this?
I first convert to
UTF-16
, which will prepend a byte-order mark, if necessary as Keith Thompson mentions. Then sinceUTF-16
doesn't define endianness, we must usefile
to determine whether it'sUTF-16BE
orUTF-16LE
. Finally, we can convert toUTF-16LE
.This may not be an elegant solution but I found a manual way to ensure correct conversion for my problem which I believe is similar to the subject of this thread.
The Problem: I got a text datafile from a user and I was going to process it on Linux (specifically, Ubuntu) using shell script (tokenization, splitting, etc.). Let's call the file
myfile.txt
. The first indication that I got that something was amiss was that the tokenization was not working. So I was not surprised when I ran thefile
command onmyfile.txt
and got the followingIf the file was compliant, here is what should have been the conversation:
The Solution: To make the datafile compliant, below are the 3 manual steps that I found to work after some trial and error with other steps.
First convert to Big Endian at the same encoding via
vi
(orvim
).vi myfile.txt
. Invi
do:set fileencoding=UTF-16BE
then write out the file. You may have to force it with:!wq
.vi myfile.txt
(which should now be in utf-16BE). Invi
do:set fileencoding=ASCII
then write out the file. Again, you may have to force the write with!wq
.Run
dos2unix
converter:d2u myfile.txt
. If you now runfile myfile.txt
you should now see an output or something more familiar and assuring like:That's it. That's what worked for me, and I was then able to run my processing bash shell script of
myfile.txt
. I found that I cannot skip Step 2. That is, in this case I cannot skip directly to Step 3. Hopefully you can find this info useful; hopefully someone can automate it perhaps viased
or the like. Cheers.UTF-16LE
tellsiconv
to generate little-endian UTF-16 without a BOM (Byte Order Mark). Apparently it assumes that since you specifiedLE
, the BOM isn't necessary.UTF-16
tells it to generate UTF-16 text (in the local machine's byte order) with a BOM.If you're on a little-endian machine, I don't see a way to tell
iconv
to generate big-endian UTF-16 with a BOM, but I might just be missing something.I find that the
file
command doesn't recognize UTF-16 text without a BOM, and your editor might not either. But if you runiconv -f UTF-16LE -t UTF_8 b-16 strings
, you should get a valid UTF-8 version of the original file.Try running
od -c
on the files to see their actual contents.UPDATE :
It looks like you're on a big-endian machine (x86 is little-endian), and you're trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell,
iconv
won't do that directly. But this should work:The behavior of the
printf
might depend on your locale settings; I haveLANG=en_US.UTF-8
.(Can anyone suggest a more elegant solution?)
Another workaround, if you know the endianness of the output produced by
-t utf-16
: