Recently I've faced an issue regarding character encoding, while I was digging into character set and character encoding this doubt came to my mind.UTF-8 encoding is most popular because of its backward compatibility with ASCII.Since UTF-8 is variable length encoding format, how it differentiates single byte and double byte characters.For example, "Aݔ" is stored as "410754" (Unicode for A is 41 and Unicode for Arabic character is 0754.How encoding identifies 41 is one character and 0754 is another two-byte character?Why it's not considered as 4107 as one double byte character and 54 as a single byte character?
相关问题
- UrlEncodeUnicode and browser navigation errors
- ruby 1.9 wrong file encoding on windows
- WebElement.getText() function and utf8
- How to convert a string to a byte array which is c
- Does specifying the encoding in javac yield the sa
相关文章
- iconv() Vs. utf8_encode()
- Why is `'↊'.isnumeric()` false?
- How to display unicode in SVG?
- When sending XML to JMS should I use TextMessage o
- Spanish Characters in HTML Page Title
- Google app engine datastore string encoding proble
- UnicodeEncodeError when saving ImageField containi
- How can i get know that my String contains diacrit
Short answer:
UTF-8 is designed to be able to unambiguously identify the type of each byte in a text stream:
Your example
Aݔ
, which consists of the Unicode code points U+0041 and U+0754, is encoded in UTF-8 as:So, when decoding, UTF-8 knows that the first byte must be a 1-byte code, the second byte must be the leading byte of a 2-byte code, the third byte must be a continuation byte, and since the second byte is the leading byte of a 2-byte code, the second and third byte together must form this 2-byte code.
See here how UTF-8 encodes Unicode code points.
That's not how UTF-8 works.
Characters U+0000 through U+007F (aka ASCII) are stored as single bytes. They are the only characters whose codepoints numerically match their UTF-8 presentation. For example, U+0041 becomes
0x41
which is0100001
in binary.All other characters are represented with multiple bytes. U+0080 through U+07FF use two bytes each, U+0800 through U+FFFF use three bytes each, and U+10000 through U+10FFFF use four bytes each.
Computers know where one character ends and the next one starts because UTF-8 was designed so that the single-byte values used for ASCII do not overlap with those used in multi-byte sequences. The bytes
0x00
through0x7F
are only used for ASCII and nothing else; the bytes above0x7F
are only used for multi-byte sequences and nothing else. Furthermore, the bytes that are used at the beginning of the multi-byte sequences also cannot occur in any other position in those sequences.Because of that the codepoints need to be encoded. Consider the following binary patterns:
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The amount of ones in the first byte tells you how many of the following bytes still belong to the same character. All bytes that belong to the sequence start with
10
in binary. To encode the character you convert its codepoint to binary and fill in the x's.As an example: U+0754 is between U+0080 and U+07FF, so it needs two bytes.
0x0754
in binary is11101010100
, so you replace the x's with those digits:11011101 10010100