I have read Joel's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" but still don't understand all the details. An example will illustrate my issues. Look at this file below:
alt text http://www.yart.com.au/stackoverflow/unicode2.png
I have opened the file in a binary editor to closely examine the last of the three a's next to the first Chinese character:
alt text http://www.yart.com.au/stackoverflow/unicode1.png
According to Joel:
In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
So does the editor say:
- E6 (230) is above code point 128.
- Thus I will interpret the following bytes as either 2, 3, in fact, up to 6 bytes.
If so, what indicates that the interpretation is more than 2 bytes? How is this indicated by the bytes that follow E6?
Is my Chinese character stored in 2, 3, 4, 5 or 6 bytes?
UTF-8 is constructed in way such that there is no possible ambiguity about where a character starts and how many bytes it has.
It's really simple.
UTF-8 has a lot of redundancy.
If you want to tell how many bytes long a character is, there are multiple ways to tell.
Some bytes are never used, like 0xC1 to 0xC2 or 0xF5 to 0xFF, so if you encounter these bytes anywhere, then you are not looking at UTF-8.
Essentially, if it begins with a 0, it's a 7 bit code point. If it begins with 10, it's a continuation of a multi-byte codepoint. Otherwise, the number of 1's tell you how many bytes this code point is encoded as.
The first byte indicates how many bytes encode the code point.
0xxxxxxx 7 bits of code point encoded in 1 bytes
110xxxxx 10xxxxxx 10 bits of code point encoded in 2 bytes
110xxxxx 10xxxxxx 10xxxxxx etc. 1110xxxx 11110xxx etc.
3 bytes
http://en.wikipedia.org/wiki/UTF-8#Description
why there are so many complicated answers?
3 bytes for 1 Chinese character. using this function( under jQuery) :
The hint is in this sentence here:
Every code point up to 127 has the top bit set to zero. Therefore, the editor knows that if it encounters a byte where the top bit is a 1, it is the start of a multi-byte character.
Code points up to 0x7ff is stored as 2 bytes; up to 0xffff as 3 bytes; everything else as 4 bytes. (Technically, up to 0x1fffff, but the highest codepoint allowed in Unicode is 0x10ffff.)
When decoding, the first byte of the multi-byte sequence is used to determine the number of bytes used to make the sequence:
110x xxxx
=> 2-byte sequence1110 xxxx
=> 3-byte sequence1111 0xxx
=> 4-byte sequenceAll subsequent bytes in the sequence must fit the
10xx xxxx
pattern.