What is the difference between UTF-8 and Unicode

I have heard conflicting opinions from people - according to Wikipedia, see here.

They are the same thing, aren't they? Can someone clarify?

标签： unicode encoding utf-8 character-encoding terminology

13条回答

2楼-- · 2019-01-01 08:01

They're not the same thing - UTF-8 is a particular way of encoding Unicode.

There are lots of different encodings you can choose from depending on your application and the data you intend to use. The most common are UTF-8, UTF-16 and UTF-32 s far as I know.

0人赞添加讨论(0) 举报

浅入江南

3楼-- · 2019-01-01 08:06

They are the same thing, aren't they?

No, they aren't.

I think the first sentence of the Wikipedia page you referenced gives a nice, brief summary:

UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes.

To elaborate:

Unicode is a standard, which defines a map from characters to numbers, the so-called code points, (like in the example below). For the full mapping, you can have a look here.
```
! -> U+0021 (21),  
" -> U+0022 (22),  
\# -> U+0023 (23)
```
UTF-8 is one of the ways to encode these code points in a form a computer can understand, aka bits. In other words, it's a way/algorithm to convert each of those code points to a sequence of bits or convert a sequence of bits to the equivalent code points. Note that there are a lot of alternative encodings for Unicode.

Joel gives a really nice explanation and an overview of the history here.

0人赞添加讨论(0) 举报

忆尘夕之涩

4楼-- · 2019-01-01 08:07

Unicode is just a standard that defines a character set (UCS) and encodings (UTF) to encode this character set. But in general, Unicode is refered to the character set and not the standard.

Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and Unicode In 5 Minutes.

0人赞添加讨论(0) 举报

泛滥B

5楼-- · 2019-01-01 08:08

Unicode only define code points, that is, a number which represents a character. How you store these code points in memory depends of the encoding that you are using. UTF-8 is one way of encoding Unicode characters, among many others.

0人赞添加讨论(0) 举报

人间绝色

6楼-- · 2019-01-01 08:10

UTF-8 is a method for encoding Unicode characters using 8-bit sequences.

Unicode is a standard for representing a great variety of characters from many languages.

0人赞添加讨论(0) 举报

看风景的人

7楼-- · 2019-01-01 08:13

Unicode is a broad-scoped standard which defines over 130,000 characters and allocates each a numerical code (a "codepoint"). It also defines rules for how to sort this text, normalise it, change its case, and more. A character in Unicode is represented by a code point from zero up to 0x10FFFF inclusive, though some code points are reserved and cannot be used for characters.

The codes in Unicode can be represented in more than one encoding. The simplest is UTF-32, which simply encodes the code point as 32-bit integers, with each being 4 bytes wide.

UTF-8 is another encoding, and quickly becoming the de-facto standard. It encodes as a sequence of byte values. Each code point can use a variable number of these bytes. Code points in the ASCII range are encoded bare, to be compatible with ASCII. Code points outside this range use a variable number of bytes, either 2, 3, or 4, depending on what range they are in.

UTF-8 has been designed with these properties in mind:

ASCII characters are encoded exactly as they are in ASCII, such that an ASCII string is also valid as UTF-8.
Binary sorting: Sorting UTF-8 strings using a naive binary sort will still result in all code points being sorted in numerical order.
Characters outside the ASCII range do not use any bytes in the ASCII range, ensuring they cannot be mistaken for ASCII characters. This is also a security feature.
UTF-8 can be easily validated, and distinguished from other character encodings by a validator. Text in other 8-bit or multi-byte encodings will very rarely also validate as UTF-8.
Random access: At any point in the UTF-8 string it is possible to tell if the byte at that position is the first byte of a character or not, and to backtrack to the start of that character, without needing to refer to anything at the start of the string.

0人赞添加讨论(0) 举报

1 2 3 下一页

What is the difference between UTF-8 and Unicode

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间