Why does anyone use an encoding other than UTF-8?

2020-05-29 13:53发布

问题:

I want to know why any developer would need to use an encoding other than UTF-8.

回答1:

Wikipedia lists advantages and disadvantages of UTF-8 as compared to a variety of other encodings:

http://en.wikipedia.org/wiki/UTF-8#Advantages_and_disadvantages

The most important disadvantages are IMHO that UTF-8 might use significantly more space especially in Asian languages such as Chinese, Japanese or Hindi and that not all code points have the same size which makes measurements more difficult and many string operations such as search inefficient.



回答2:

Well, some do it because their tools are archaic or flawed. Some do it because they don't see a need to support anything other than ASCII. Some do it because they don't know any better.

Those are the usual excuses for not using Unicode.

As for not using UTF-8 specifically there are different reasons. Some systems, like Windows1 (and stemming from that, .NET) and Java came to be in a time where Unicode was a strict 16-bit code. Therefore, there was really only one encoding: UCS-2, encoding code points directly as 16-bit words.

Later Unicode was expanded to 21 bits because 65536 code points weren't enough anymore. This caused encodings such as UTF-32 and UTF-16 to appear. For systems previously working with UCS-2 the transition to UTF-16 was the easiest and most sensible choice. Windows did that transition back in Ye Olde Days of Windows 2000.

So while I think that nearly all application nowadays should support Unicode I don't think it is entirely necessary for them to specifically use UTF-8. There are historic reasons for that and no real benefit in converting existing systems from UTF-16 to UTF-8.


1 NT.



回答3:

In UTF-8 code points between 0800 and FFFF take up three bytes in UTF-8 but only two in UTF-16. See the wikipedia comparison for more details, but basically if text heavily uses code points in this range (say, if it's Chinese), UTF-8 files will be larger than UTF-16 files with the same content.



回答4:

UTF-8 is very efficient at encoding plain English text (same as ASCII). If your user base is likely to be mostly, say, Chinese, you will be much better off using UTF-16.

For more information, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.



回答5:

Because outside the English-speaking world, people have been using various encodings that predate Unicode and are tailored for their respective languages for decades. These language-specific encodings have become ingrained everywhere and are pretty much a standard. If you want to have any hope of interfacing with legacy systems, you have to use them, so all systems have to support them and usually use them as default even if they by now support UTF-8 as well. There may even be multiple legacy encodings traditionally used for different purposes.

Examples:

  • ISO-8859-1 in western Europe - actually outdated there as well, as you need ISO-8859-15 for the Euro sign
  • ISO-2022-JP in Japan for emails, Shift JIS for websites
  • Big5 in Taiwan
  • GB2312 in China

The last two examples show that encodings can even be a political issue.



回答6:

Sometimes they are restricted due to historical/unsupported reasons (I'm developing on Windows using Zend Studio on a Samba share on a Linux box: and something in that mix means I keep reverting to Cp1512 instead of UTF8).

Sometimes you don't need to use UTF-8 (for example when storing a md5 hash in a database: you only need the hexadecimal range 0-9 A-F: why make it a UTF-8 field, which will take at least a byte extra storage instead of normal ASCII).

Sometimes it's just laziness learning the UTF-8 functions for a particular language.



回答7:

Because they do not know better. The only valid criticism to utf-8 is that encodings for common Asian languages are oversized from other encodings. UTF-8 is superior because

  • It is ASCII compatible. Most known and tried string operations do not need adaptation.
  • It is Unicode. Anything that isn't Unicode shouldn't even be considered in this day and age. If you have important data in encoding X, spend two minutes on Google and write a conversion function. Even if you have to interface with sourceless legacy app Z, you can run your communications through a pipe so that your logic stays in the 21st century.
  • UTF-16 isn't fixed length either and assuming it is like many do, will only cause terrible bugs.
  • Additionally Unicode is very complex and it is almost certain than any fixed-size algorithm adapted from ASCII will yield bad results even in UTF-32.

Say you have this UTF-16 string.

[0][1][2][F|3] [4] [5]

And you want to insert a character with code 8 between [3] and [4] you would do insert(5,8)

If you don't check for characters outside BMP(serially as in UTF-8 as you cannot know how many double sized characters you have) you get:

[0][1][2][F|8][3][4][5]

Two new garbage characters. So much for your fixed size encoding. You can of course disallow such characters altogether, but then when your code interfaces with the real world, you might find your program saves the profile for this user who lives in rm -Rf / in .profile instead of [Classical Chinese Proverb].profile.

Or just an angry user that cannot write his thesis on Classical Chinese Proverbs with your software.



回答8:

One legitimate reason is when you need to deal with legacy documents, software or hardware that are not Unicode compatible.

Another legitimate reason is that you need to use a programming language / libraries that do not support UTF8 / Unicode well ... or at all.

Other answers mention that UTF-16 is more compact than UTF-8 for Asian languages / characters.

And of course there are reasons like short-sightedness, ignorance, laziness ... and deadlines.



回答9:

http://www.personal.psu.edu/ejp10/blogs/gotunicode/2007/02/cjk-unicode-angst-in-japan-and.html has a good summary + links about the difficulty Japanese users have with Unicode.

http://www.hastingsresearch.com/net/04-unicode-limitations.shtml

Apparently Unicode is moving away from unification due to such complaints.



回答10:

Its also worth remembering that in some circumstances (where a non-latin set of characters are needed) UTF-8 can actually bloat larger than the 16 bit Unicode encoding. In those cases ucs-2 or utf-16 would be a better choice.



回答11:

The reasons for using non-Unicode 8-bit character sets / encodings are all back compatibility of some kind, and/or inertia. For that matter, the most frequent reasons for using UTF-8 are compatibility with standards like XML that mandate or prefer UTF-8.

Differences in the number of bytes you think text will take up in different encodings, especially in storage, are mostly theoretical. In real world situations, compatibility requirements are more important. If compression is used, the size differences go away anyway. Even if compression is not used, total text size is hard to predict and is rarely a deciding factor.

When converting legacy code that used non-Unicode 8-bit encodings, using UTF-16 can be a tool for making sure all code has been converted, because mismatches can be caught as compile-time type errors. Many languages, runtimes and libraries like Javascript, JVM, .NET, ICU use 16-bit strings and UTF-16, even though storage and Internet protocols are usually 8-bit.



回答12:

Imagine all files to consider are in GB2312 (China mainland standard). Then you might choose GB18030 as Unicode encoding instead. They are compatible the same way as all ASCII is UTF-8. That is useful in China mainland!

You might decide even quicker when you find out that both mentioned GB-standards are required in your IT-product by law (as far as I have heard), if you want to ship in China (mainland).

Another upside is that GB2312, and as such GB18030 as well, are also ASCII compatible.

It is algorithmically not so robust, though. – So if you have no political reasons or any GB2312 legacy, it makes no sense to use it. But if you do, here you got your answer.



回答13:

Related to the subject, when using MySQL, as if it wasn't complex enough, you get the option the choose which kind of UTF-8 collation you want to use. So what would you use?

UTF-8 general ci or UTF-8 unicode ci?

(I tend to use the UTF-8 variant that is used for the database connection)



回答14:

Because you sometimes want to operate easily on codepoints -- then you'd choose f.e. UCS-2 or UCS-4.



回答15:

Many APIs require other Unicode encodings - mostly UTF-16. For instance, Java, .NET, Win32.



回答16:

At my previous employer we used iso-8859-1 for some of our ASP pages to match the collation of our SQL Server, which as you can guess was not Unicode. I wanted to change the collation, but the manager said to wait till we upgrade our SQL Server to do it. Needless to say it never happened - I haven't been with them for a little over a year now, so I don't know if they finally did it.



回答17:

Unicode certainly is a good place to work from in most cases, but a developer should be familiar with many different types of character encoding. Certainly ASCII might be used if the set of characters is limited.

What if you're a developer and receiving data from a source that doesn't send UTF-8? There could be lots of interface issues if you don't understand your input.

Joel's article on the must-knows for character encoding is good and worth reading.