I have heard conflicting opinions from people - according to Wikipedia, see here.
They are the same thing, aren't they? Can someone clarify?
I have heard conflicting opinions from people - according to Wikipedia, see here.
They are the same thing, aren't they? Can someone clarify?
They're not the same thing - UTF-8 is a particular way of encoding Unicode.
There are lots of different encodings you can choose from depending on your application and the data you intend to use. The most common are UTF-8, UTF-16 and UTF-32 s far as I know.
No, they aren't.
I think the first sentence of the Wikipedia page you referenced gives a nice, brief summary:
To elaborate:
Unicode is a standard, which defines a map from characters to numbers, the so-called code points, (like in the example below). For the full mapping, you can have a look here.
UTF-8 is one of the ways to encode these code points in a form a computer can understand, aka bits. In other words, it's a way/algorithm to convert each of those code points to a sequence of bits or convert a sequence of bits to the equivalent code points. Note that there are a lot of alternative encodings for Unicode.
Joel gives a really nice explanation and an overview of the history here.
Unicode is just a standard that defines a character set (UCS) and encodings (UTF) to encode this character set. But in general, Unicode is refered to the character set and not the standard.
Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and Unicode In 5 Minutes.
Unicode only define code points, that is, a number which represents a character. How you store these code points in memory depends of the encoding that you are using. UTF-8 is one way of encoding Unicode characters, among many others.
UTF-8 is a method for encoding Unicode characters using 8-bit sequences.
Unicode is a standard for representing a great variety of characters from many languages.
Unicode is a broad-scoped standard which defines over 130,000 characters and allocates each a numerical code (a "codepoint"). It also defines rules for how to sort this text, normalise it, change its case, and more. A character in Unicode is represented by a code point from zero up to 0x10FFFF inclusive, though some code points are reserved and cannot be used for characters.
The codes in Unicode can be represented in more than one encoding. The simplest is UTF-32, which simply encodes the code point as 32-bit integers, with each being 4 bytes wide.
UTF-8 is another encoding, and quickly becoming the de-facto standard. It encodes as a sequence of byte values. Each code point can use a variable number of these bytes. Code points in the ASCII range are encoded bare, to be compatible with ASCII. Code points outside this range use a variable number of bytes, either 2, 3, or 4, depending on what range they are in.
UTF-8 has been designed with these properties in mind:
ASCII characters are encoded exactly as they are in ASCII, such that an ASCII string is also valid as UTF-8.
Binary sorting: Sorting UTF-8 strings using a naive binary sort will still result in all code points being sorted in numerical order.
Characters outside the ASCII range do not use any bytes in the ASCII range, ensuring they cannot be mistaken for ASCII characters. This is also a security feature.
UTF-8 can be easily validated, and distinguished from other character encodings by a validator. Text in other 8-bit or multi-byte encodings will very rarely also validate as UTF-8.
Random access: At any point in the UTF-8 string it is possible to tell if the byte at that position is the first byte of a character or not, and to backtrack to the start of that character, without needing to refer to anything at the start of the string.