I have read Joel's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" but still don't understand all the details. An example will illustrate my issues. Look at this file below:
alt text http://www.yart.com.au/stackoverflow/unicode2.png
I have opened the file in a binary editor to closely examine the last of the three a's next to the first Chinese character:
alt text http://www.yart.com.au/stackoverflow/unicode1.png
According to Joel:
In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
So does the editor say:
- E6 (230) is above code point 128.
- Thus I will interpret the following bytes as either 2, 3, in fact, up to 6 bytes.
If so, what indicates that the interpretation is more than 2 bytes? How is this indicated by the bytes that follow E6?
Is my Chinese character stored in 2, 3, 4, 5 or 6 bytes?
If the encoding is UTF-8, then the following table shows how a Unicode code point (up to 21 bits) is converted into UTF-8 encoding:
There are a number of non-allowed values - in particular, bytes 0xC1, 0xC2, and 0xF5 - 0xFF can never appear in well-formed UTF-8. There are also a number of other verboten combinations. The irregularities are in the 1st byte and 2nd byte columns. Note that the codes U+D800 - U+DFFF are reserved for UTF-16 surrogates and cannot appear in valid UTF-8.
These tables are lifted from the Unicode standard version 5.1.
In the question, the material from offset 0x0010 .. 0x008F yields:
That's all part of the UTF8 encoding (which is only one encoding scheme for Unicode).
The size can figured out by examining the first byte as follows:
"10" (0x80-0xbf)
, it's not the first byte of a sequence and you should back up until you find the start, any byte that starts with "0" or "11" (thanks to Jeffrey Hantin for pointing that out in the comments)."0" (0x00-0x7f)
, it's 1 byte."110" (0xc0-0xdf)
, it's 2 bytes."1110" (0xe0-0xef)
, it's 3 bytes."11110" (0xf0-0xf7)
, it's 4 bytes.I'll duplicate the table showing this, but the original is on the Wikipedia UTF8 page here.
The Unicode characters in the above table are constructed from the bits:
where the
z
andy
bits are assumed to be zero where they're not given. Some bytes are considered illegal as a start byte since they're either:In addition, subsequent bytes in a multi-byte sequence that don't begin with the bits "10" are also illegal.
As an example, consider the sequence [0xf4,0x8a,0xaf,0x8d]. This is a 4-byte sequence as the first byte falls between 0xf0 and 0xf7.
For your specific query with the first byte 0xe6 (length = 3), the byte sequence is:
If you look that code up here, you'll see it's the one you had in your question: 澳.
To show how the decoding works, I went back to my archives to find my UTF8 handling code. I've had to morph it a bit to make it a complete program and the encoding has been removed (since the question was really about decoding), so I hope I haven't introduced any errors from the cut and paste:
You can run it with your sequence of bytes (you'll need 4 so use 0 to pad them out) as follows:
An excellent reference for this is Markus Kuhn's UTF-8 and Unicode FAQ.