UTF-16 is a two-byte character encoding. Exchanging the two bytes' addresses will produce UTF-16BE and UTF-16LE.
But I find the name UTF-16 encoding exists in the Ubuntu gedit
text editor, as well as UTF-16BE and UTF-16LE. With a C test program I found my computer is little endian, and UTF-16 is confirmed as same encoding of UTF-16LE.
Also: There are two byte orders of a value (such as integer) in little/big endian computers. Little endian computers will produce little endian values in hardware (except the value produced by Java which always forms a big endian).
While text can be saved as UTF-16LE as well as UTF-16BE in my little endian computer, are characters produced one byte by one byte (such as the ASCII string, reference to [3] and the endianness of UTF-16 just defined by the human -- not as a result of the phenomenon that big endian machines write big endian UTF-16 while little endian machines write little endian UTF-16?
- http://www.ibm.com/developerworks/aix/library/au-endianc/
- http://teaching.idallen.com/cst8281/10w/notes/110_byte_order_endian.html
- ASCII strings and endianness
- Is it true that endianness only affects the memory layout of numbers,but not string? This a post of relation between endianness of string and machine.
"is endian of UTF-16 the computer's endianness?"
The impact of your computer's endianness can be looked at from the point of view of a writer or a reader of a file.
If you are reading a file in a -standard- format, then the kind of machine reading it shouldn't matter. The format should be well-defined enough that no matter what the endianness of the reading machine is, the data can still be read correctly.
That doesn't mean the format can't be flexible. With "UTF-16" (when a "BE" or "LE" disambiguation is not used in the format name) the definition allows files to be marked as either big endian or little endian. This is done with something called the "Byte Order Mark" (BOM) in the first two bytes of the file:
https://en.wikipedia.org/wiki/Byte_order_mark
The existence of the BOM gives options to the writer of a file. They might choose to write out the most natural endianness for a buffer in memory, and include a BOM that matched. This wouldn't necessarily be the most efficient format for some other reader. But any program claiming UTF-16 support is supposed to be able to handle it either way.
So yes--the computer's endianness might factor into the endianness choice of a BOM-marked UTF-16 file. Still...a little-endian program is fully able to save a file, label it "UTF-16" and have it be big-endian. As long as the BOM is consistent with the data, it doesn't matter what kind of machine writes or reads it.
...what if there's no BOM?
This is where things get a little hazy.
On the one hand, the Unicode RFC 2781 and Unicode FAQ are clear. They say that a file in "UTF-16" format which starts with neither
0xFF 0xFE
nor0xFE 0xFF
is to be interpreted as big endian:Yet to know if you have UTF-16-LE, UTF-16-BE, or UTF-16 file with no BOM...you need metadata outside the file telling you which of the three it is. Because there's not always a place to put that data, some programs wound up using heuristics.
Consider something like this from Raymond Chen (2007):
That's a valid UTF-16LE file, but where would the "UTF-16LE" meta-label be stored? What are the odds someone passes that off by just calling it a UTF-16 file?
Empirically there are warnings about the term. The Wikipedia page for UTF-16 says:
And unicode.readthedocs.org says:
And further, the Byte-Order-Mark Wikipedia article says:
So despite the unambiguity of the standard, the context may matter in practice.
As @rici points out, the standard has been around for a while now. Still, it may pay to do double-checks on files claimed as "UTF-16". Or even to consider if you might want to avoid a lot of these issues and embrace UTF-8...
"Should UTF-16 be considered harmful?"
The Unicode encoding schemes are defined in section 3.10 of the Unicode standard. The standard defines seven encoding schemes:
In the case of the 16- and 32-bit encodings, the three variants differ in endianness, which may be explicit or indicated by starting the string with a Byte Order Mark (BOM) character, U+FEFF:
LE
variant is definitely little-endian; the low-order byte is encoded first. No BOM is permitted, so an initial character U+FEFF is a zero-width no-break space.BE
variant is definitely big-endian; the high-order byte is encoded first. As with theLE
variant, no BOM is permitted, so an initial character U+FEFF is a zero-width no-break space.If you are going to use 16- or 32-bit encoding schemes for data serialization, it is generally recommended to use the unmarked variants with an explicit BOM. However, UTF-8 is a much more common data interchange format.
Although no endian marker is needed for UTF-8, it is permitted (but not recommended) to start a UTF-8 encoded string with a BOM; this can be used to differentiate between Unicode encoding schemes. Many Windows programs do this, and a U+FEFF at the beginning of a UTF-8 transmission should probably be treated as a BOM (and thus not as Unicode data).
No. Don't you see little endian computers receive packets from internet all the time which is big endian?
The encoding depends on how you write to memory, not how your architecture is.