Why is it that UTF-8 encoding is used when interac

2020-02-05 15:59发布

问题:

I know it is customary, but why? Are there real technical reasons why any other way would be a really bad idea or is it just based on the history of encoding and backwards compatibility? In addition, what are the dangers of not using UTF-8, but some other encoding (most notably, UTF-16)?

Edit : By interacting, I mostly mean the shell and libc.

回答1:

Partly because the file systems expect NUL ('\0') bytes to terminate file names, so UTF-16 would not work well. You'd have to modify a lot of code to make that change.



回答2:

As jonathan-leffler mentions, the prime issue is the ASCII null character. C traditionally expects a string to be null terminated. So standard C string functions will choke on any UTF-16 character containing a byte equivalent to an ASCII null (0x00). While you can certainly program with wide character support, UTF-16 is not a suitable external encoding of Unicode in filenames, text files, environment variables.

Furthermore, UTF-16 and UTF-32 have both big endian and little endian orientations. To deal with this, you'll either need external metadata like a MIME type, or a Byte Orientation Mark. It notes,

Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts.

The predecessor to UTF-16, which was called UCS-2 and didn't support surrogate pairs, had the same issues. UCS-2 should be avoided.



回答3:

I believe it's mainly the backwards compatability that UTF8 gives with ASCII.

For an answer to the 'dangers' question, you need to specify what you mean by 'interacting'. Do you mean interacting with the shell, with libc, or with the kernel proper?



回答4:

Modern Unixes use UTF-8, but this was not always true. On RHEL2 -- which is only a few years old -- the default is

$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
The C/POSIX locale is expected to be a 7-bit ASCII-compatible encoding.

However, as Jonathan Leffler stated, any encoding which allows for NUL bytes within a character sequence is unworkable on Unix, as system APIs are locale-ignorant; strings are all assumed to be byte sequences terminated by \0.



回答5:

I believe that when Microsoft started using a two byte encoding, characters above 0xffff had not been assigned, so using a two byte encoding meant that no-one had to worry about characters being different lengths.

Now that there are characters outside this range, so you'll have to deal with characters of different lengths anyway, why would anyone use UTF-16? I suspect Microsoft would make a different decision if they were desigining their unicode support today.



回答6:

Yes, it's for compatibility reasons. UTF-8 is backwards comptable with ASCII. Linux/Unix were ASCII based, so it just made/makes sense.



回答7:

I thought 7-bit ASCII was fine.

Seriously, Unicode is relatively new in the scheme of things, and UTF-8 is backward compatible with ASCII and uses less space (half) for typical files since it uses 1 to 4 bytes per code point (character), while UTF-16 uses either 2 or 4 bytes per code point (character).

UTF-16 is preferable for internal program usage because of the simpler widths. Its predecessor UCS-2 was exactly 2 bytes for every code point.



回答8:

I think it's because programs that expect ASCII input won't be able to handle encodings such as UTF-16. For most characters (in the 0-255 range), those programs will see the high byte as a NUL / 0 char, which is used in many languages and systems to mark the end of a string. This doesn't happen in UTF-8, which was designed to avoid embedded NUL's and be byte-order agnostic.