UTF-8: General? Bin? Unicode?

I'm trying to figure out what collation I should be using for various types of data. 100% of the content I will be storing is user-submitted.

My understanding is that I should be using UTF-8 General CI (Case-Insensitive) instead of UTF-8 Binary. However, I can't find a clear a distinction between UTF-8 General CI and UTF-8 Unicode CI.

Should I be storing user-submitted content in UTF-8 General or UTF-8 Unicode CI columns?
What type of data would UTF-8 Binary be applicable to?

标签： mysql utf-8 collation

5条回答

余欢

2楼-- · 2019-01-01 10:05

Really, I tested saving values like 'é' and 'e' in column with unique index and they cause duplicate error on both 'utf8_unicode_ci' and 'utf8_general_ci'. You can save them only in 'utf8_bin' collated column.

And mysql docs (in http://dev.mysql.com/doc/refman/5.7/en/charset-applications.html) suggest into its examples set 'utf8_general_ci' collation.

[mysqld]
character-set-server=utf8
collation-server=utf8_general_ci

0人赞添加讨论(0) 举报

妖精总统

3楼-- · 2019-01-01 10:08

Accepted answer is outdated.

If you use MySQL 5.5.3+, use utf8mb4_unicode_ci instead of utf8_unicode_ci to ensure the characters typed by your users won't give you errors.

utf8mb4 supports emojis for example, whereas utf8 might give you hundreds of encoding-related bugs like:

Incorrect string value: ‘\xF0\x9F\x98\x81…’ for column ‘data’ at row 1

0人赞添加讨论(0) 举报

呛了眼睛熬了心

4楼-- · 2019-01-01 10:21

In general, utf8_general_ci is faster than utf8_unicode_ci, but less correct.

Here is the difference:

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

Quoted from: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

For more detailed explanation, please read the following post from MySQL forums: http://forums.mysql.com/read.php?103,187048,188748

As for utf8_bin: Both utf8_general_ci and utf8_unicode_ci perform case-insensitive comparison. In constrast, utf8_bin is case-sensitive (among other differences), because it compares the binary values of the characters.

0人赞添加讨论(0) 举报

皆成旧梦

5楼-- · 2019-01-01 10:26

utf8_bin compares the bits blindly. No case folding, no accent stripping.
utf8_general_ci compares one byte with one byte. It does case folding and accent stripping, but no 2-character comparisions: ij is not equal ĳ in this collation.
utf8_*_ci is a set of language-specific rules, but otherwise like unicode_ci. Some special cases: Ç, Č, ch, ll
utf8_unicode_ci follows an old Unicode standard for comparisons. ij=ĳ, but ae != æ
utf8_unicode_520_ci follows an newer Unicode standard. ae = æ

See collation chart for details on what is equal to what in various utf8 collations.

utf8, as defined by MySQL is limited to the 1- to 3-byte utf8 codes. This leaves out Emoji and some of Chinese. So you should really switch to utf8mb4 if you want to go much beyond Europe.

The above points apply to utf8mb4, after suitable spelling change. Going forward, utf8mb4 and utf8mb4_unicode_520_ci are preferred.

utf16 and utf32 are variants on utf8; there is virtually no use for them.
ucs2 is closer to "Unicode" than "utf8"; there is virtually no use for it.

0人赞添加讨论(0) 举报

十年一品温如言

6楼-- · 2019-01-01 10:27

You should also be aware of the fact, that with utf8_general_ci when using a varchar field as unique or primary index inserting 2 values like 'a' and 'á' would give a duplicate key error.

0人赞添加讨论(0) 举报

UTF-8: General? Bin? Unicode?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间