Why does UTF-8 use more than one byte to represent

I recently went through an article on Character Encoding. I've a concern on a certain point mentioned there.

In the first figure, the author shows the characters, their code points in various character sets and how they are encoded in various encoding formats. For example the code point of é is E9. In ISO-8859-1 encoding it is represented as E9. In UTF-16 it is represented as 00 E9. But in UTF-8 it is represented using 2 bytes, C3 A9.

My question is why is this required? It can be represented with 1 byte. Why are two bytes used? Can you please let me know?

标签： utf-8 character-encoding

3条回答

对你真心纯属浪费

2楼-- · 2019-04-08 11:58

A single byte can hold one of only 256 different values.

This means that an encoding that represents each character as a single byte, such as ISO-8859-1, cannot encode more than 256 different characters. This is why you can't use ISO-8859-1 to correctly write Arabic, or Japanese, or many other languages. There is only a limited amount of space available, and it is already used up by other characters.

UTF-8, on the other hand, needs to be capable of representing all of the millions of characters in Unicode. This makes it impossible to squeeze every single character into a single byte.

The designers of UTF-8 chose to make all of the ASCII characters (U+0000 to U+007F) representable with a single byte, and required all other characters to be stored as two or more bytes. If they had chosen to give more characters a single-byte representation, the encodings of other characters would have been longer and more complicated.

If you want a visual explanation of why bytes above 7F don't represent the corresponding 8859-1 characters, look at the UTF-8 coding unit table on Wikipedia. You will see that every byte value outside the ASCII range either already has a meaning, or is illegal for historical reasons. There just isn't room in the table for bytes to represent their 8859-1 equivalents, and giving the bytes additional meanings would break several important properties of UTF-8.

0人赞添加讨论(0) 举报

forever°为你锁心

3楼-- · 2019-04-08 11:59

Because many languages it 2 bit encoding that is simply not enough to encode all the letters of all alphabets Look 2 bit encoding 00 .. FF 15 ^ 2 = 255 characters 4 bit 0000 ... FFFF 4 ^ 15 = 50625

0人赞添加讨论(0) 举报

We Are One

4楼-- · 2019-04-08 12:04

UTF-8 uses the 2 high bits (bit 6 and bit 7) to indicate if there are any more bytes: Only the low 6 bits are used for the actual character data. That means that any character over 7F requires (at least) 2 bytes.

0人赞添加讨论(0) 举报

Why does UTF-8 use more than one byte to represent

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间