Encoding.UTF8 or Encoding.Unicode?

2019-08-27 04:10发布

Is Encoding.Unicode just a name for UTF-16? Then why is it called just Unicode instead of UTF16?

In the encoding documentation Microsoft states that for most scenarios and applications you should avoid using Encoding.ASCII and Encoding.Default.

When using System.Text.Encoding. In most scenarios should I be using Encoding.Unicode or Encoding.UTF8?

2条回答
唯我独甜
2楼-- · 2019-08-27 04:47

It comes from the early days of Unicode. Unicode 1.0 was a 16 bit encoding as it was assumed that 65536 code points would be sufficient. Unicode 2.0 abandoned this restriction, however early adopters of Unicode, including Microsoft, Named their encoding Unicode and it has stuck.

Nowadays you should be using UTF-8 unless you have a specific, eg legacy software you need to integrate with, reason to do so.

The reason for this is that ASCII is binary compatible with UTF-8, and there is a lot of ASCII out there

查看更多
混吃等死
3楼-- · 2019-08-27 04:47

Is Encoding.Unicode just a name for UTF-16?

Yes. Specifically, for little endian UTF-16. Encoding has a separate BigEndianUnicode property for big endian UTF-16.

Then why is it called just Unicode instead of UTF16?

For historical reasons. Microsoft was one of the 1st companies to adopt Unicode, so it had a "Unicode" implementation in Windows way back in the early days of Unicode before UTF-16 was invented. "Unicode" is Microsoft's de-facto name to refer to whatever its native Unicode encoding is, which used to be UCS-2 and is now UTF-16.

When using System.Text.Encoding. In most scenarios should I be using Encoding.Unicode or Encoding.UTF8?

That really depends on your particular scenarios. Use whichever encoding suits your needs. Both encodings have strengths and weaknesses.

UTF-8 is commonly used for interoperability in communications protocols, as it doesn't suffer from endian problems, and is largely compatible with most existing textual based protocols. It is also usually smaller for byte storage than UTF-16 for most languages.

UTF-16 is usually easier to process in memory than UTF-8, which is why so many libraries and frameworks use it for Strings. And it can be smaller for byte storage than UTF-8, especially for Eastern Asian languages.

查看更多
登录 后发表回答