I recently went through an article on Character Encoding. I've a concern on a certain point mentioned there.
In the first figure, the author shows the characters, their code points in various character sets and how they are encoded in various encoding formats.
For example the code point of é is E9
.
In ISO-8859-1
encoding it is represented as E9
.
In UTF-16
it is represented as 00 E9
.
But in UTF-8
it is represented using 2 bytes, C3 A9
.
My question is why is this required? It can be represented with 1 byte. Why are two bytes used? Can you please let me know?
UTF-8 uses the 2 high bits (bit 6 and bit 7) to indicate if there are any more bytes: Only the low 6 bits are used for the actual character data. That means that any character over 7F
requires (at least) 2 bytes.
A single byte can hold one of only 256 different values.
This means that an encoding that represents each character as a single byte, such as ISO-8859-1, cannot encode more than 256 different characters. This is why you can't use ISO-8859-1 to correctly write Arabic, or Japanese, or many other languages. There is only a limited amount of space available, and it is already used up by other characters.
UTF-8, on the other hand, needs to be capable of representing all of the millions of characters in Unicode. This makes it impossible to squeeze every single character into a single byte.
The designers of UTF-8 chose to make all of the ASCII characters (U+0000 to U+007F) representable with a single byte, and required all other characters to be stored as two or more bytes. If they had chosen to give more characters a single-byte representation, the encodings of other characters would have been longer and more complicated.
If you want a visual explanation of why bytes above 7F
don't represent the corresponding 8859-1 characters, look at the UTF-8 coding unit table on Wikipedia. You will see that every byte value outside the ASCII range either already has a meaning, or is illegal for historical reasons. There just isn't room in the table for bytes to represent their 8859-1 equivalents, and giving the bytes additional meanings would break several important properties of UTF-8.
Because many languages it 2 bit encoding that is simply not enough to encode all the letters of all alphabets
Look
2 bit encoding 00 .. FF 15 ^ 2 = 255 characters
4 bit 0000 ... FFFF 4 ^ 15 = 50625