Well hello there.
I'm changing the charset of a database from latin1_sweedish_ci
to utf8
. I've allways used utf8_danish_ci
because it's the closest to the Norwegian character style - I think.
But what about utf8_general_ci
and utf8_unicode_ci
?
Some time ago; It was preferred to use _general_ci
for better/faster performance, and _unicode_ci
for more accuracy because of the sorting algorithm is more complex in the latter. But since the speed/performace is no longer an issue - or not so much an issue i most cases anymore - _unicode_ci
is ok to use in most situations?
But how does _unicode_ci
differ from _danish_ci
?
Is it the last three letters æ
, ø
, å
in the nordic alphabet that is taken into account?
Most comparisson (one vs the other) I can find is only between _general_ci
and _unicode_ci
.
Anyone know of any examples for when to use _unicode_ci
or when to use _danish_ci
would be highly appreciated...
Please bear in mind that Collation != Encoding.
Encoding is the mapping between integers (which all that a database can store at the end of the day) and human readable graphic representations of characters.
Collation is the ordering rule used to sort characters according to the conventional alphabetical order of a given language. Notice that this ordering does not reflect the actual order of the internal, numeric representation.
Your question reduces to the following: what kind of alphabetical order should you use in your application? This cannot be answered.
I'm not 100% sure but I believe the
utf8_danish_ci
is a subset of (either)utf8
(collation).That being said, and if your database is
utf8
encoded, it makes little sense to use the danish collation.Quick test (since I'm in a hurry and I can't find a collation list for
utf8_unicode
):utf8_danish_ci
utf8_general_ci
or preferablyutf8_unicode_ci
Related link.
UPDATE
My hypothesis was wrong.
I did some tests and apparently
utf8_unicode_ci
does not sort in the same order, so nevermind.In short, if your app is multilingual and storing multiple languages within the same tables, you're mostly screwed and should worry about sorting/collating outside of the database -- utf8_general_ci is then as good as any other.
If it only supports a single language, you'll do fine by setting the correct collation at the db level -- in your case, utf8_danish_ci indeed, since it's the same as norwegian if wikipedia is anything to go by.
In case you'd like to read more on collation, the ICU docs has colorful examples of how thorny this kind of stuff gets. Quoting extensively:
http://userguide.icu-project.org/collation
Collation determines both which characters can be stored in a table, as well as the order of characters. Choosing anything starting with utf8 should cover most needs for character storage so utf8_general_ci is a good choice. If you're going to focus on one language, you can choose the local collation, such as utf8_danish_ci which means the order will be be Danish characters and case insensitive (the ci part).
For a multilingual application, you can store fields with utf8_general_ci and when you're in need of a specific sort or comparison to match your preferred language, add the COLLATE word to your query with your preferred collation from https://dev.mysql.com/doc/refman/5.6/en/charset-unicode-sets.html
The answer provided by @Denis above, claiming you cannot sort in MySQL, is wrong in my experience.