I'm trying to figure out what collation I should be using for various types of data. 100% of the content I will be storing is user-submitted.
My understanding is that I should be using UTF-8 General CI (Case-Insensitive) instead of UTF-8 Binary. However, I can't find a clear a distinction between UTF-8 General CI and UTF-8 Unicode CI.
- Should I be storing user-submitted content in UTF-8 General or UTF-8 Unicode CI columns?
- What type of data would UTF-8 Binary be applicable to?
Really, I tested saving values like 'é' and 'e' in column with unique index and they cause duplicate error on both 'utf8_unicode_ci' and 'utf8_general_ci'. You can save them only in 'utf8_bin' collated column.
And mysql docs (in http://dev.mysql.com/doc/refman/5.7/en/charset-applications.html) suggest into its examples set 'utf8_general_ci' collation.
Accepted answer is outdated.
If you use MySQL 5.5.3+, use
utf8mb4_unicode_ci
instead ofutf8_unicode_ci
to ensure the characters typed by your users won't give you errors.utf8mb4
supports emojis for example, whereasutf8
might give you hundreds of encoding-related bugs like:Incorrect string value: ‘\xF0\x9F\x98\x81…’ for column ‘data’ at row 1
In general, utf8_general_ci is faster than utf8_unicode_ci, but less correct.
Here is the difference:
Quoted from: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html
For more detailed explanation, please read the following post from MySQL forums: http://forums.mysql.com/read.php?103,187048,188748
As for utf8_bin: Both utf8_general_ci and utf8_unicode_ci perform case-insensitive comparison. In constrast, utf8_bin is case-sensitive (among other differences), because it compares the binary values of the characters.
utf8_bin
compares the bits blindly. No case folding, no accent stripping.utf8_general_ci
compares one byte with one byte. It does case folding and accent stripping, but no 2-character comparisions:ij
is not equalij
in this collation.utf8_*_ci
is a set of language-specific rules, but otherwise likeunicode_ci
. Some special cases:Ç
,Č
,ch
,ll
utf8_unicode_ci
follows an old Unicode standard for comparisons.ij
=ij
, butae
!=æ
utf8_unicode_520_ci
follows an newer Unicode standard.ae
=æ
See collation chart for details on what is equal to what in various utf8 collations.
utf8
, as defined by MySQL is limited to the 1- to 3-byte utf8 codes. This leaves out Emoji and some of Chinese. So you should really switch toutf8mb4
if you want to go much beyond Europe.The above points apply to
utf8mb4
, after suitable spelling change. Going forward,utf8mb4
andutf8mb4_unicode_520_ci
are preferred.You should also be aware of the fact, that with utf8_general_ci when using a varchar field as unique or primary index inserting 2 values like 'a' and 'á' would give a duplicate key error.