Why is the maximum Unicode code point restricted to 0x10FFFF? Is it possible to represent Unicode above this code point - for e.g. 0x10FFFF + 0x000001 = 0x110000 - through any encoding schemes like UTF-16, UTF-8?
相关问题
- UrlEncodeUnicode and browser navigation errors
- How to convert a string to a byte array which is c
- Character Encoding in iframes
- Unicode issue with makemessages --all Django 1.6.2
- How to make CSS input range thumb not appear at fi
相关文章
- iconv() Vs. utf8_encode()
- Why is `'↊'.isnumeric()` false?
- How to display unicode in SVG?
- When sending XML to JMS should I use TextMessage o
- Google app engine datastore string encoding proble
- UnicodeEncodeError when saving ImageField containi
- How can i get know that my String contains diacrit
- Python - arranging words in alphabetical order
It's because of UTF-16. Characters outside of the BMP are represented using a surrogate pair in UTF-16 with the first code unit lies between 0xD800–0xDBFF and the second one between 0xDC00–0xDFFF. Each of the CU represents 10 bits of the code point, allowing total 20 bits of data (0x100000 characters) which is split into 16 planes (16×216 characters). The remaining BMP will represent 0x10000 characters (code points 0–0xFFFF)
Therefore the total number of characters is 0x100000 + 0x10000 = 0x110000 which allows for code points from 0 to 0x110000 - 1 = 0x10FFFF. Alternatively the last representable code point can be calculated like this: Code points in the BMP are in the range 0–0xFFFF, so the offset for characters encoded with a surrogate pair is 0xFFFF + 1 = 0x10000, which means the last code point that a surrogate pair represents is 0xFFFFF + 0x10000 = 0x10FFFF
That's guaranteed by Unicode Character Encoding Stability Policies that a code point above that will never be assigned
Historically UTF-8 allows up to U+7FFFFFFF using 6 bytes whereas UTF-32 can store twice the number of that. However due to the limit in UTF-16 the Unicode committee has decided that UTF-8 can never be longer than 4 bytes, resulting in the same range as UTF-16
The same has been applied to UTF-32
You can read this more detailed answer and