On the Unicode site it's written that UTF-8 can be represented by 1-4 bytes. As I understand from this question https://softwareengineering.stackexchange.com/questions/77758/why-are-there-multiple-unicode-encodings UTF-8 is an 8-bits encoding. So, what's the truth? If it's 8-bits encoding, then what's the difference between ASCII and UTF-8? If it's not, then why is it called UTF-8 and why do we need UTF-16 and others if they occupy the same memory?
相关问题
- UrlEncodeUnicode and browser navigation errors
- ruby 1.9 wrong file encoding on windows
- WebElement.getText() function and utf8
- Does specifying the encoding in javac yield the sa
- Unicode issue with makemessages --all Django 1.6.2
相关文章
- Why is `'↊'.isnumeric()` false?
- How to display unicode in SVG?
- Spanish Characters in HTML Page Title
- UnicodeEncodeError when saving ImageField containi
- Base64 Encoding: Illegal base64 character 3c
- read xml in UTF-8 in scala
- Why is TextView showing the unicode right arrow (\
- C++ (Standard) Exceptions and Unicode
UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.
To understand this further, Unicode treats characters as codepoints - a mere number that can be represented in multiple ways (the encodings). UTF-8 is one such encoding. It is most commonly used, for it gives the best space consumption characteristics among all encodings. If you are storing characters from the ASCII character set in UTF-8 encoding, then the UTF-8 encoded data will take the same amount of space. This allowed for applications that previously used ASCII to seamlessly move (well, not quite, but it certainly didn't result in something like Y2K) to Unicode, for the character representations are the same.
I'll leave this extract here from RFC 3629, on how the UTF-8 encoding would work:
You'll notice why the encoding will result in characters occupying anywhere between 1 and 4 bytes (the right-hand column) for different ranges of characters in Unicode (the left-hand column).
UTF-16, UTF-32, UCS-2 etc. will employ different encoding schemes where the codepoints would represented as 16-bit or 32-bit codes, instead of 8-bit codes that UTF-8 does.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky - Wednesday, October 08, 2003
Excerpt from above:
The '8-bit' encoding means that the individual bytes of the encoding use 8 bits. In contrast, pure ASCII is a 7-bit encoding as it only has code points 0-127. It used to be that software had problems with 8-bit encodings; one of the reasons for Base-64 and uuencode encodings was to get binary data through email systems that did not handle 8-bit encodings. However, it's been a decade or more since that ceased to be allowable as a problem - software has had to be 8-bit clean, or capable of handling 8-bit encodings.
Unicode itself is a 21-bit character set. There are a number of encodings for it:
So, "UTF-8 can be represented by 1-4 bytes" is probably not the most appropriate way of phrasing it. "Unicode code points can be represented by 1-4 bytes in UTF-8" would be more appropriate.