I have read many articles in order to know what is the maximum number of the Unicode code points, but I did not find a final answer.
I understood that the Unicode code points were minimized to make all of the UTF-8 UTF-16 and UTF-32 encodings able to handle the same number of code points. But what is this number of code points?
The most frequent answer I encountered is that Unicode code points are in the range of 0x000000 to 0x10FFFF (1,114,112 code points) but I have also read in other places that it is 1,112,114 code points. So is there a one number to be given or is the issue more complicated than that?
I have made a very little routine that prints onscreen a very long table, from 0 to n values where the var start is a number that can be customizable by the user. This is the snippet:
Run it a first time, then open the code and set the
start
variable's value to a very high number just a little bit lower than MAXEND constant value. The following is what I obtained:The output of course is truncated (between the first and the second execution) cause it is too long.
After the 4294967294 (= 2^32) the function inexorably stops so I suppose that it has reached its max possible value: so I interpret this as the max possible value of the unicode char code table. Of course as said by other answers, not all char code have an equivalent symbols but frequently they are empty, as the example showed. Also there are a lot of symbols that are repeated multiple time in different points between 0 to 4294967294 char codes
Edit: improvements
(thanks @duskwuff)
Now it is also possible to compare both String.fromCharCode and String.fromCodePoint behaviors. Notice that the first statement arrives to 4294967294 but the output is repeated every 65536 (16 bit = 2^16). The last one stops working at code 1114111 (cause the list of unicode char and symbols starts from 0 we have a total of 1,114,112 Unicode code points but as said in other answers not all of them are valid in the sense that they are empty points). Also remember that to use a certain unicode char you need to have an appropriate font that has the corresponding char defined in it. If not you will show an empty unicode char or an empty square char.
Notice:
I have noticed that in some Android systems using Chrome Browser for Android the js
String.fromCodePoint
returns an error for all codepoints.yes, all the code points that can't be represented in UTF-16 (including using surrogates) have been declared invalid.
U+10FFD seems to be the higest code point, but the surrogates, u+00FFFE and u+00FFFF aren't usable code points so the total count is a bit lower.
The maximum valid code point in Unicode is U+10FFFF, which makes it a 21-bit code set (but not all 21-bit integers are valid Unicode code points; specifically the values from 0x110000 to 0x1FFFFF are not valid Unicode code points).
This is where the number 1,114,112 comes from: U+0000 .. U+10FFFF is 1,114,112 values.
However, there are also a set of code points that are the surrogates for UTF-16. These are in the range U+D800 .. U+DFFF. This is 2048 code points that are reserved for UTF-16.
1,114,112 - 2,048 = 1,112,064
There are also 66 non-characters. These are defined in part in Corrigendum #9: 34 values of the form U+nFFFE and U+nFFFF (where n is a value 0x00000, 0x10000, … 0xF0000, 0x100000), and 32 values U+FDD0 - U+FDEF. Subtracting those too yields 1,111,998 allocatable characters. There are three ranges reserved for 'private use': U+E000 .. U+F8FF, U+F0000 .. U+FFFFD, and U+100000 .. U+10FFFD. And the number of values actually assigned depends on the version of Unicode you're looking at. You can find information about the latest version at the Unicode Consortium. Amongst other things, the Introduction there says:
So only about 10% of the available code points have been allocated.
I can't account for why you found 1,112,114 as the number of code points.
Incidentally, the upper limit U+10FFFF is chosen so that all the values in Unicode can be represented in one or two 2-byte coding units in UTF-16, using one high surrogate and one low surrogate to represent values outside the BMP or Basic Multilingual Plane, which is the range U+0000 .. U+FFFF.