I can read the MySQL documentation and it's pretty clear. But, how does one decide which character set to use? On what data does collation have an effect?
I'm asking for an explanation of the two and how to choose them.
I can read the MySQL documentation and it's pretty clear. But, how does one decide which character set to use? On what data does collation have an effect?
I'm asking for an explanation of the two and how to choose them.
A character set is a subset of all written glyphs. A character encoding specifies how those characters are mapped to numeric values. Some character encodings, like UTF-8 and UTF-16, can encode any character in the Universal Character Set. Others, like US-ASCII or ISO-8859-1 can only encode a small subset, since they use 7 and 8 bits per character, respectively. Because many standards specify both a character set and a character encoding, the term "character set" is often substituted freely for "character encoding".
A collation comprises rules that specify how characters can be compared for sorting. Collations rules can be locale-specific: the proper order of two characters varies from language to language.
Choosing a character set and collation comes down to whether your application is internationalized or not. If not, what locale are you targeting?
In order to choose what character set you want to support, you have to consider your application. If you are storing user-supplied input, it might be hard to foresee all the locales in which your software will eventually be used. To support them all, it might be best to support the UCS (Unicode) from the start. However, there is a cost to this; many western European characters will now require two bytes of storage per character instead of one.
Choosing the right collation can help performance if your database uses the collation to create an index, and later uses that index to provide sorted results. However, since collation rules are often locale-specific, that index will be worthless if you need to sort results according to the rules of another locale.
A character encoding is a way to encode characters so that they fit in memory. That is, if the charset is ISO-8859-15, the euro symbol, €, will be encoded as 0xa4, and in UTF-8, it will be 0xe282ac.
The collation is how to compare characters, in latin9, there are letters as
e é è ê f
, if sorted by their binary representation, it will goe f é ê è
but if the collation is set to, for example, french, you'll have them in the order you thought they would be, which is all ofe é è ê
are equal, and thenf
.I suggest to use
utf8mb4_unicode_ci
, which is based on the Unicode standard for sorting and comparison, which sorts accurately in a very wide range of languages.From MySQL docs: