Convert UTF8 to UTF16 using iconv

2019-03-11 23:53发布

When I use iconv to convert from UTF16 to UTF8 then all is fine but vice versa it does not work. I have these files:

a-16.strings:    Little-endian UTF-16 Unicode c program text
a-8.strings:     UTF-8 Unicode c program text, with very long lines

The text look OK in editor. When I run this:

iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16.strings

Then I get this result:

b-16.strings:    data
a-16.strings:    Little-endian UTF-16 Unicode c program text
a-8.strings:     UTF-8 Unicode c program text, with very long lines

The file utility does not show expected file format and the text does not look good in editor either. Could it be that iconv does not create proper BOM? I run it on MAC command line.

Why is not the b-16 in proper UTF-16LE format? Is there another way of converting utf8 to utf16?

More elaboration is bellow.

$ iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16le-BAD-fromUTF8.strings
$ iconv -f UTF-8 -t UTF-16 a-8.strings > b-16be.strings 
$ iconv -f UTF-16 -t UTF-16LE b-16be.strings > b-16le-BAD-fromUTF16BE.strings

$ file *s
a-16.strings:                   Little-endian UTF-16 Unicode c program text, with very long lines
a-8.strings:                    UTF-8 Unicode c program text, with very long lines
b-16be.strings:                 Big-endian UTF-16 Unicode c program text, with very long lines
b-16le-BAD-fromUTF16BE.strings: data
b-16le-BAD-fromUTF8.strings:    data


$ od -c a-16.strings | head
0000000  377 376   /  \0   *  \0      \0  \f 001   E  \0   S  \0   K  \0

$ od -c a-8.strings | head 
0000000    /   *   *   *       Č  **   E   S   K   Y       (   J   V   O

$ od -c b-16be.strings | head
0000000  376 377  \0   /  \0   *  \0   *  \0   *  \0     001  \f  \0   E

$ od -c b-16le-BAD-fromUTF16BE.strings | head                                
0000000    /  \0   *  \0   *  \0   *  \0      \0  \f 001   E  \0   S  \0

$ od -c b-16le-BAD-fromUTF8.strings | head
0000000    /  \0   *  \0   *  \0   *  \0      \0  \f 001   E  \0   S  \0

It is clear the BOM is missing whenever I run conversion to UTF-16LE. Any help on this?

3条回答
再贱就再见
2楼-- · 2019-03-12 00:00

I first convert to UTF-16, which will prepend a byte-order mark, if necessary as Keith Thompson mentions. Then since UTF-16 doesn't define endianness, we must use file to determine whether it's UTF-16BE or UTF-16LE. Finally, we can convert to UTF-16LE.

iconv -f utf-8 -t utf-16 UTF-8-FILE > UTF-16-UNKNOWN-ENDIANNESS-FILE
FILE_ENCODING="$( file --brief --mime-encoding UTF-16-UNKNOWN-ENDIANNESS-FILE )"
iconv -f "$FILE_ENCODING" -t UTF-16LE UTF-16-UNKNOWN-ENDIANNESS-FILE > UTF-16-FILE
查看更多
聊天终结者
3楼-- · 2019-03-12 00:09

This may not be an elegant solution but I found a manual way to ensure correct conversion for my problem which I believe is similar to the subject of this thread.

The Problem: I got a text datafile from a user and I was going to process it on Linux (specifically, Ubuntu) using shell script (tokenization, splitting, etc.). Let's call the file myfile.txt. The first indication that I got that something was amiss was that the tokenization was not working. So I was not surprised when I ran the file command on myfile.txt and got the following

$ file myfile.txt

myfile.txt: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators

If the file was compliant, here is what should have been the conversation:

$ file myfile.txt

myfile.txt: ASCII text, with very long lines

The Solution: To make the datafile compliant, below are the 3 manual steps that I found to work after some trial and error with other steps.

  1. First convert to Big Endian at the same encoding via vi (or vim). vi myfile.txt. In vi do :set fileencoding=UTF-16BE then write out the file. You may have to force it with :!wq.

  2. vi myfile.txt (which should now be in utf-16BE). In vi do :set fileencoding=ASCII then write out the file. Again, you may have to force the write with !wq.

  3. Run dos2unix converter: d2u myfile.txt. If you now run file myfile.txt you should now see an output or something more familiar and assuring like:

    myfile.txt: ASCII text, with very long lines
    

That's it. That's what worked for me, and I was then able to run my processing bash shell script of myfile.txt. I found that I cannot skip Step 2. That is, in this case I cannot skip directly to Step 3. Hopefully you can find this info useful; hopefully someone can automate it perhaps via sed or the like. Cheers.

查看更多
Emotional °昔
4楼-- · 2019-03-12 00:15

UTF-16LE tells iconv to generate little-endian UTF-16 without a BOM (Byte Order Mark). Apparently it assumes that since you specified LE, the BOM isn't necessary.

UTF-16 tells it to generate UTF-16 text (in the local machine's byte order) with a BOM.

If you're on a little-endian machine, I don't see a way to tell iconv to generate big-endian UTF-16 with a BOM, but I might just be missing something.

I find that the file command doesn't recognize UTF-16 text without a BOM, and your editor might not either. But if you run iconv -f UTF-16LE -t UTF_8 b-16 strings, you should get a valid UTF-8 version of the original file.

Try running od -c on the files to see their actual contents.

UPDATE :

It looks like you're on a big-endian machine (x86 is little-endian), and you're trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell, iconv won't do that directly. But this should work:

( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le UTF-8-FILE ) > UTF-16-FILE

The behavior of the printf might depend on your locale settings; I have LANG=en_US.UTF-8.

(Can anyone suggest a more elegant solution?)

Another workaround, if you know the endianness of the output produced by -t utf-16:

iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv=swab 2>/dev/null
查看更多
登录 后发表回答