I have read many articles in order to know what is the maximum number of the Unicode code points, but I did not find a final answer.
I understood that the Unicode code points were minimized to make all of the UTF-8 UTF-16 and UTF-32 encodings able to handle the same number of code points. But what is this number of code points?
The most frequent answer I encountered is that Unicode code points are in the range of 0x000000 to 0x10FFFF (1,114,112 code points) but I have also read in other places that it is 1,112,114 code points. So is there a one number to be given or is the issue more complicated than that?
The maximum valid code point in Unicode is U+10FFFF, which makes it a 21-bit code set (but not all 21-bit integers are valid Unicode code points; specifically the values from 0x110000 to 0x1FFFFF are not valid Unicode code points).
This is where the number 1,114,112 comes from: U+0000 .. U+10FFFF is 1,114,112 values.
However, there are also a set of code points that are the surrogates for UTF-16. These are in the range U+D800 .. U+DFFF. This is 2048 code points that are reserved for UTF-16.
1,114,112 - 2,048 = 1,112,064
There are also 66 non-characters. These are defined in part in Corrigendum #9: 34 values of the form U+nFFFE and U+nFFFF (where n is a value 0x00000, 0x10000, … 0xF0000, 0x100000), and 32 values U+FDD0 - U+FDEF. Subtracting those too yields 1,111,998 allocatable characters. There are three ranges reserved for 'private use': U+E000 .. U+F8FF, U+F0000 .. U+FFFFD, and U+100000 .. U+10FFFD. And the number of values actually assigned depends on the version of Unicode you're looking at. You can find information about the latest version at the Unicode Consortium. Amongst other things, the Introduction there says:
The Unicode Standard, Version 7.0, contains 112,956 characters
So only about 10% of the available code points have been allocated.
I can't account for why you found 1,112,114 as the number of code points.
Incidentally, the upper limit U+10FFFF is chosen so that all the values in Unicode can be represented in one or two 2-byte coding units in UTF-16, using one high surrogate and one low surrogate to represent values outside the BMP or Basic Multilingual Plane, which is the range U+0000 .. U+FFFF.
yes, all the code points that can't be represented in UTF-16 (including using surrogates) have been declared invalid.
U+10FFD seems to be the higest code point, but the surrogates, u+00FFFE and u+00FFFF aren't usable code points so the total count is a bit lower.
I have made a very little routine that prints onscreen a very long table, from 0 to n values where the var start is a number that can be customizable by the user. This is the snippet:
function getVal()
{
var start = parseInt(document.getElementById('start').value);
var range = parseInt(document.getElementById('range').value);
var end = start + range;
return [start, range, end];
}
function next()
{
var values = getVal();
document.getElementById('start').value = values[2];
document.getElementById('ok').click();
}
function prev()
{
var values = getVal();
document.getElementById('start').value = values[0] - values[1];
document.getElementById('ok').click();
}
function renderCharCodeTable()
{
var values = getVal();
var start = values[0];
var end = values[2];
const MINSTART = 0; // Allowed range
const MAXEND = 4294967294; // Allowed range
start = start < MINSTART ? MINSTART : start;
end = end < MINSTART ? (MINSTART + 1) : end;
start = start > MAXEND ? (MAXEND - 1) : start;
end = end >= MAXEND ? (MAXEND + 1) : end;
var tr = [];
var unicodeCharSet = document.getElementById('unicodeCharSet');
var cCode;
var cPoint;
for (var c = start; c < end; c++)
{
try
{
cCode = String.fromCharCode(c);
}
catch (e)
{
cCode = 'fromCharCode max val exceeded';
}
try
{
cPoint = String.fromCodePoint(c);
}
catch (e)
{
cPoint = 'fromCodePoint max val exceeded';
}
tr[c] = '<tr><td>' + c + '</td><td>' + cCode + '</td><td>' + cPoint + '</td></tr>'
}
unicodeCharSet.innerHTML = tr.join('');
}
function startRender()
{
setTimeout(renderCharCodeTable, 100);
console.time('renderCharCodeTable');
}
unicodeCharSet.addEventListener("load",startRender());
body
{
margin-bottom: 50%;
}
form
{
position: fixed;
}
table *
{
border: 1px solid black;
font-size: 1em;
text-align: center;
}
table
{
margin: auto;
border-collapse: collapse;
}
td:hover
{
padding-bottom: 1.5em;
padding-top: 1.5em;
}
tbody > tr:hover
{
font-size: 5em;
}
<form>
Start Unicode: <input type="number" id="start" value="0" onchange="renderCharCodeTable()" min="0" max="4294967300" title="Set a number from 0 to 4294967294" >
<p></p>
Show <input type="number" id="range" value="30" onchange="renderCharCodeTable()" min="1" max="1000" title="Range to show. Insert a value from 10 to 1000" > symbols at once.
<p></p>
<input type="button" id="pr" value="◄◄" onclick="prev()" title="Mostra precedenti" >
<input type="button" id="nx" value="►►" onclick="next()" title="Mostra successivi" >
<input type="button" id="ok" value="OK" onclick="startRender()" title="Ok" >
<input type="reset" id="rst" value="X" onclick="startRender()" title="Reset" >
</form>
<table>
<thead>
<tr>
<th>CODE</th>
<th>Symbol fromCharCode</th>
<th>Symbol fromCodePoint</th>
</tr>
</thead>
<tbody id="unicodeCharSet">
<tr><td colspan="2">Rendering...</td></tr>
</tbody>
</table>
Run it a first time, then open the code and set the start
variable's value to a very high number just a little bit lower than MAXEND constant value. The following is what I obtained:
code equivalent symbol
{~~~ first execution output example ~~~~~}
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33 !
34 "
35 #
36 $
37 %
38 &
39 '
40 (
41 )
42 *
43 +
44 ,
45 -
46 .
47 /
48 0
49 1
50 2
51 3
52 4
53 5
54 6
55 7
56 8
57 9
{~~~ second execution output example ~~~~~}
4294967275 →
4294967276 ↓
4294967277 ■
4294967278 ○
4294967279
4294967280
4294967281
4294967282
4294967283
4294967284
4294967285
4294967286
4294967287
4294967288
4294967289
4294967290
4294967291
4294967292 
4294967293 �
4294967294
The output of course is truncated (between the first and the second execution) cause it is too long.
After the 4294967294 (= 2^32) the function inexorably stops so I suppose that it has reached its max possible value: so I interpret this as the max possible value of the unicode char code table. Of course as said by other answers, not all char code have an equivalent symbols but frequently they are empty, as the example showed. Also there are a lot of symbols that are repeated multiple time in different points between 0 to 4294967294 char codes
Edit: improvements
(thanks @duskwuff)
Now it is also possible to compare both String.fromCharCode and String.fromCodePoint behaviors. Notice that the first statement arrives to 4294967294 but the output is repeated every 65536 (16 bit = 2^16). The last one stops working at code 1114111 (cause the list of unicode char and symbols starts from 0 we have a total of 1,114,112 Unicode code points but as said in other answers not all of them are valid in the sense that they are empty points). Also remember that to use a certain unicode char you need to have an appropriate font that has the corresponding char defined in it. If not you will show an empty unicode char or an empty square char.
Notice:
I have noticed that in some Android systems using Chrome Browser for Android the js String.fromCodePoint
returns an error for all codepoints.