EDIT
Since it seems I'm not going to get an answer to the general question. I'll restrict it to one detail: Is my understanding of the following, correct?
That surrogates work as follows:
- If the first pair of bytes is not between D800 and DBFF - there will not be a second pair.
- If it is between D800 and DBFF - a) there will be a second pair b) the second pair will be in the range of DC00 and DFFF.
- There is no single pair UTF16 character with a value between D800 and DBFF.
- There is no single pair UTF16 character with a value between DC00 and DFFF.
Is this right?
Original question
I've tried reading about UTF16 but I can't seem to understand it. What are "planes" and "surrogates" etc.? Is a "plane" the first 5 bits of the first byte? If so, then why not 32 planes since we're using those 5 bits anyway? And what are surrogates? Which bits do they correspond to?
I do understand that UTF16 is a way to encode Unicode characters, and that it sometimes encodes characters using 16 bits, and sometimes 32 bits, no more no less. I assume that there is some list of values for the first 2 bytes (which are the most significant ones?) which indicates that a second 2 bytes will be present.
But instead of me going on about what I don't understand, perhaps someone can make some order in this?