I have heard conflicting opinions from people - according to Wikipedia, see here.
They are the same thing, aren\'t they? Can someone clarify?
I have heard conflicting opinions from people - according to Wikipedia, see here.
They are the same thing, aren\'t they? Can someone clarify?
To expand on the answers others have given:
We\'ve got lots of languages with lots of characters that computers should ideally display. Unicode assigns each character a unique number, or code point.
Computers deal with such numbers as bytes... skipping a bit of history here and ignoring memory addressing issues, 8-bit computers would treat an 8-bit byte as the largest numerical unit easily represented on the hardware, 16-bit computers would expand that to two bytes, and so forth.
Old character encodings such as ASCII are from the (pre-) 8-bit era, and try to cram the dominant language in computing at the time, i.e. English, into numbers ranging from 0 to 127 (7 bits). With 26 letters in the alphabet, both in capital and non-capital form, numbers and punctuation signs, that worked pretty well. ASCII got extended by an 8th bit for other, non-English languages, but the additional 128 numbers/code points made available by this expansion would be mapped to different characters depending on the language being displayed. The ISO-8859 standards are the most common forms of this mapping; ISO-8859-1 and ISO-8859-15 (also known as ISO-Latin-1, latin1, and yes there are two different versions of the 8859 ISO standard as well).
But that\'s not enough when you want to represent characters from more than one language, so cramming all available characters into a single byte just won\'t work.
There are essentially two different types of encodings: one expands the value range by adding more bits. Examples of these encodings would be UCS2 (2 bytes = 16 bits) and UCS4 (4 bytes = 32 bits). They suffer from inherently the same problem as the ASCII and ISO-8859 standards, as their value range is still limited, even if the limit is vastly higher.
The other type of encoding uses a variable number of bytes per character, and the most commonly known encodings for this are the UTF encodings. All UTF encodings work in roughly the same manner: you choose a unit size, which for UTF-8 is 8 bits, for UTF-16 is 16 bits, and for UTF-32 is 32 bits. The standard then defines a few of these bits as flags: if they\'re set, then the next unit in a sequence of units is to be considered part of the same character. If they\'re not set, this unit represents one character fully. Thus the most common (English) characters only occupy one byte in UTF-8 (two in UTF-16, 4 in UTF-32), but other language characters can occupy six bytes or more.
Multi-byte encodings (I should say multi-unit after the above explanation) have the advantage that they are relatively space-efficient, but the downside that operations such as finding substrings, comparisons, etc. all have to decode the characters to unicode code points before such operations can be performed (there are some shortcuts, though).
Both the UCS standards and the UTF standards encode the code points as defined in Unicode. In theory, those encodings could be used to encode any number (within the range the encoding supports) - but of course these encodings were made to encode Unicode code points. And that\'s your relationship between them.
Windows handles so-called \"Unicode\" strings as UTF-16 strings, while most UNIXes default to UTF-8 these days. Communications protocols such as HTTP tend to work best with UTF-8, as the unit size in UTF-8 is the same as in ASCII, and most such protocols were designed in the ASCII era. On the other hand, UTF-16 gives the best average space/processing performance when representing all living languages.
The Unicode standard defines fewer code points than can be represented in 32 bits. Thus for all practical purposes, UTF-32 and UCS4 became the same encoding, as you\'re unlikely to have to deal with multi-unit characters in UTF-32.
Hope that fills in some details.
\"Unicode\" is unfortunately used in various different ways, depending on the context. Its most correct use (IMO) is as a coded character set - i.e. a set of characters and a mapping between the characters and integer code points representing them.
UTF-8 is a character encoding - a way of converting from sequences of bytes to sequences of characters and vice versa. It covers the whole of the Unicode character set. ASCII is encoded as a single byte per character, and other characters take more bytes depending on their exact code point (up to 4 bytes for all currently defined code points, i.e. up to U-0010FFFF, and indeed 4 bytes could cope with up to U-001FFFFF).
When \"Unicode\" is used as the name of a character encoding (e.g. as the .NET Encoding.Unicode property) it usually means UTF-16, which encodes most common characters as two bytes. Some platforms (notably .NET and Java) use UTF-16 as their \"native\" character encoding. This leads to hairy problems if you need to worry about characters which can\'t be encoded in a single UTF-16 value (they\'re encoded as \"surrogate pairs\") - but most developers never worry about this, IME.
Some references on Unicode:
Let me use an example to illustrate this topic:
A chinese character: 汉
it\'s unicode value: U+6C49
convert 6C49 to binary: 01101100 01001001
Nothing magical so far, it\'s very simple. Now, let\'s say we decide to store this character on our hard drive. To do that, we need to store the character in binary format. We can simply store it as is \'01101100 01001001\'. Done!
But wait a minute, is \'01101100 01001001\' one character or two characters? You knew this is one character because I told you, but when a computer reads it, it has no idea. So we need some sort of \"encoding\" to tell the computer to treat it as one.
This is where the rules of \'UTF-8\' comes in: http://www.fileformat.info/info/unicode/utf8.htm
Binary format of bytes in sequence
1st Byte 2nd Byte 3rd Byte 4th Byte Number of Free Bits Maximum Expressible Unicode Value
0xxxxxxx 7 007F hex (127)
110xxxxx 10xxxxxx (5+6)=11 07FF hex (2047)
1110xxxx 10xxxxxx 10xxxxxx (4+6+6)=16 FFFF hex (65535)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (3+6+6+6)=21 10FFFF hex (1,114,111)
According to the table above, if we want to store this character using the \'UTF-8\' format, we need to prefix our character with some \'headers\'. Our chinese character is 16 bits long (count the binary value yourself), so we will use the format on row 3 as it provides enough space:
Header Place holder Fill in our Binary Result
1110 xxxx 0110 11100110
10 xxxxxx 110001 10110001
10 xxxxxx 001001 10001001
Writing out the result in one line:
11100110 10110001 10001001
This is the UTF-8 (binary) value of the chinese character! (confirm it yourself: http://www.fileformat.info/info/unicode/char/6c49/index.htm)
A chinese character: 汉
it\'s unicode value: U+6C49
convert 6C49 to binary: 01101100 01001001
embed 6C49 as UTF-8: 11100110 10110001 10001001
They\'re not the same thing - UTF-8 is a particular way of encoding Unicode.
There are lots of different encodings you can choose from depending on your application and the data you intend to use. The most common are UTF-8, UTF-16 and UTF-32 s far as I know.
Unicode only define code points, that is, a number which represents a character. How you store these code points in memory depends of the encoding that you are using. UTF-8 is one way of encoding Unicode characters, among many others.
Unicode is a standard that defines, along with ISO/IEC 10646, Universal Character Set (UCS) which is a superset of all existing characters required to represent practically all known languages.
Unicode assigns a Name and a Number (Character Code, or Code-Point) to each character in its repertoire.
UTF-8 encoding, is a way to represent these characters digitally in computer memory. UTF-8 maps each code-point into a sequence of octets (8-bit bytes)
For e.g.,
UCS Character = Unicode Han Character
UCS code-point = U+24B62
UTF-8 encoding = F0 A4 AD A2 (hex) = 11110000 10100100 10101101 10100010 (bin)
Unicode is just a standard that defines a character set (UCS) and encodings (UTF) to encode this character set. But in general, Unicode is refered to the character set and not the standard.
Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and Unicode In 5 Minutes.
The existing answers already explain a lot of details, but here\'s a very short answer with the most direct explanation and example.
Unicode is the standard that maps characters to codepoints.
Each character has a unique codepoint (identification number), which is a number like 9731.
UTF-8 is an the encoding of the codepoints.
In order to store all characters on disk (in a file), UTF-8 splits characters into up to 4 octets (8-bit sequences) - bytes.
UTF-8 is one of several encodings (methods of representing data). For example, in Unicode, the (decimal) codepoint 9731 represents a snowman (☃
), which consists of 3 bytes in UTF-8: E2 98 83
Here\'s a sorted list with some random examples.
There\'re lots of characters around the world,like \"$,&,h,a,t,?,张,1,=,+...\".
Then there comes an organization who\'s dedicated to these characters,
They made a standard called \"Unicode\".
The standard is like follows:
PS:Of course there\'s another organization called ISO maintaining another standard --\"ISO 10646\",nearly the same.
As above,U+0024 is just a position,so we can\'t save \"U+0024\" in computer for the character \"$\".
There must be an encoding method.
Then there come encoding methods,such as UTF-8,UTF-16,UTF-32,UCS-2....
Under UTF-8,the code point \"U+0024\" is encoded into 00100100.
00100100 is the value we save in computer for \"$\".
I have checked the links in Gumbo\'s answer, and I wanted to paste some part of those things here to exist on Stack Overflow as well.
\"...Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don\'t feel bad.
In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.
Until now, we\'ve assumed that a letter maps to some bits which you can store on disk or in memory:
A -> 0100 0001
In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole other story...\"
\"...Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means \"Unicode\" and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041....\"
\"...OK, so say we have a string:
Hello
which, in Unicode, corresponds to these five code points:
U+0048 U+0065 U+006C U+006C U+006F.
Just a bunch of code points. Numbers, really. We haven\'t yet said anything about how to store this in memory or represent it in an email message...\"
\"...That\'s where encodings come in.
The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let\'s just store those numbers in two bytes each. So Hello becomes
00 48 00 65 00 6C 00 6C 00 6F
Right? Not so fast! Couldn\'t it also be:
48 00 65 00 6C 00 6C 00 6F 00 ? ...\"
Unicode is a broad-scoped standard which defines over 130,000 characters and allocates each a numerical code (a \"codepoint\"). It also defines rules for how to sort this text, normalise it, change its case, and more. A character in Unicode is represented by a code point from zero up to 0x10FFFF inclusive, though some code points are reserved and cannot be used for characters.
The codes in Unicode can be represented in more than one encoding. The simplest is UTF-32, which simply encodes the code point as 32-bit integers, with each being 4 bytes wide.
UTF-8 is another encoding, and quickly becoming the de-facto standard. It encodes as a sequence of byte values. Each code point can use a variable number of these bytes. Code points in the ASCII range are encoded bare, to be compatible with ASCII. Code points outside this range use a variable number of bytes, either 2, 3, or 4, depending on what range they are in.
UTF-8 has been designed with these properties in mind:
ASCII characters are encoded exactly as they are in ASCII, such that an ASCII string is also valid as UTF-8.
Binary sorting: Sorting UTF-8 strings using a naive binary sort will still result in all code points being sorted in numerical order.
Characters outside the ASCII range do not use any bytes in the ASCII range, ensuring they cannot be mistaken for ASCII characters. This is also a security feature.
UTF-8 can be easily validated, and distinguished from other character encodings by a validator. Text in other 8-bit or multi-byte encodings will very rarely also validate as UTF-8.
Random access: At any point in the UTF-8 string it is possible to tell if the byte at that position is the first byte of a character or not, and to backtrack to the start of that character, without needing to refer to anything at the start of the string.
They are the same thing, aren\'t they?
No, they aren\'t.
I think the first sentence of the Wikipedia page you referenced gives a nice, brief summary:
UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes.
To elaborate:
Unicode is a standard, which defines a map from characters to numbers, the so-called code points, (like in the example below). For the full mapping, you can have a look here.
! -> U+0021 (21),
\" -> U+0022 (22),
\\# -> U+0023 (23)
UTF-8 is one of the ways to encode these code points in a form a computer can understand, aka bits. In other words, it\'s a way/algorithm to convert each of those code points to a sequence of bits or convert a sequence of bits to the equivalent code points. Note that there are a lot of alternative encodings for Unicode.
Joel gives a really nice explanation and an overview of the history here.
UTF-8 is a method for encoding Unicode characters using 8-bit sequences.
Unicode is a standard for representing a great variety of characters from many languages.