How does a file with Chinese characters know how m

2019-01-21 14:26发布

I have read Joel's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" but still don't understand all the details. An example will illustrate my issues. Look at this file below:

alt text http://www.yart.com.au/stackoverflow/unicode2.png

I have opened the file in a binary editor to closely examine the last of the three a's next to the first Chinese character:

alt text http://www.yart.com.au/stackoverflow/unicode1.png

According to Joel:

In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

So does the editor say:

  1. E6 (230) is above code point 128.
  2. Thus I will interpret the following bytes as either 2, 3, in fact, up to 6 bytes.

If so, what indicates that the interpretation is more than 2 bytes? How is this indicated by the bytes that follow E6?

Is my Chinese character stored in 2, 3, 4, 5 or 6 bytes?

9条回答
趁早两清
2楼-- · 2019-01-21 14:45

UTF-8 is constructed in way such that there is no possible ambiguity about where a character starts and how many bytes it has.

It's really simple.

  • A byte in the range 0x80 to 0xBF is never the first byte of a character.
  • Any other byte is always the first byte of a character.

UTF-8 has a lot of redundancy.

If you want to tell how many bytes long a character is, there are multiple ways to tell.

  • The first byte always tells you how many bytes long the character is:
    • If the first byte is 0x00 to 0x7F, it's one byte.
    • 0xC2 to 0xDF means it's two bytes.
    • 0xE0 to 0xEF means it's three bytes.
    • 0xF0 to 0xF4 means it's four bytes.
  • Or, you can just count the number of consecutive bytes in the range 0x80 to 0xBF, because these bytes all belong to the same character as the previous byte.

Some bytes are never used, like 0xC1 to 0xC2 or 0xF5 to 0xFF, so if you encounter these bytes anywhere, then you are not looking at UTF-8.

查看更多
smile是对你的礼貌
3楼-- · 2019-01-21 14:52

Essentially, if it begins with a 0, it's a 7 bit code point. If it begins with 10, it's a continuation of a multi-byte codepoint. Otherwise, the number of 1's tell you how many bytes this code point is encoded as.

The first byte indicates how many bytes encode the code point.

0xxxxxxx 7 bits of code point encoded in 1 bytes

110xxxxx 10xxxxxx 10 bits of code point encoded in 2 bytes

110xxxxx 10xxxxxx 10xxxxxx etc. 1110xxxx 11110xxx etc.

查看更多
Juvenile、少年°
5楼-- · 2019-01-21 14:54

why there are so many complicated answers?

3 bytes for 1 Chinese character. using this function( under jQuery) :

function get_length(field_selector) {
  var escapedStr = encodeURI($(field_selector).val())
  if (escapedStr.indexOf("%") != -1) {
    var count = escapedStr.split("%").length - 1
    if (count == 0) count++  //perverse case; can't happen with real UTF-8
    var tmp = escapedStr.length - (count * 3)
    count = count + tmp
  } else {
    count = escapedStr.length
  }
  return count
}
查看更多
我想做一个坏孩纸
6楼-- · 2019-01-21 14:58

The hint is in this sentence here:

In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

Every code point up to 127 has the top bit set to zero. Therefore, the editor knows that if it encounters a byte where the top bit is a 1, it is the start of a multi-byte character.

查看更多
走好不送
7楼-- · 2019-01-21 14:59

Code points up to 0x7ff is stored as 2 bytes; up to 0xffff as 3 bytes; everything else as 4 bytes. (Technically, up to 0x1fffff, but the highest codepoint allowed in Unicode is 0x10ffff.)

When decoding, the first byte of the multi-byte sequence is used to determine the number of bytes used to make the sequence:

  1. 110x xxxx => 2-byte sequence
  2. 1110 xxxx => 3-byte sequence
  3. 1111 0xxx => 4-byte sequence

All subsequent bytes in the sequence must fit the 10xx xxxx pattern.

查看更多
登录 后发表回答