Well hello there.
I'm changing the charset of a database from latin1_sweedish_ci
to utf8
. I've allways used utf8_danish_ci
because it's the closest to the Norwegian character style - I think.
But what about utf8_general_ci
and utf8_unicode_ci
?
Some time ago; It was preferred to use _general_ci
for better/faster performance, and _unicode_ci
for more accuracy because of the sorting algorithm is more complex in the latter. But since the speed/performace is no longer an issue - or not so much an issue i most cases anymore - _unicode_ci
is ok to use in most situations?
But how does _unicode_ci
differ from _danish_ci
?
Is it the last three letters æ
, ø
, å
in the nordic alphabet that is taken into account?
Most comparisson (one vs the other) I can find is only between _general_ci
and _unicode_ci
.
Anyone know of any examples for when to use _unicode_ci
or when to use _danish_ci
would be highly appreciated...
In short, if your app is multilingual and storing multiple languages within the same tables, you're mostly screwed and should worry about sorting/collating outside of the database -- utf8_general_ci is then as good as any other.
If it only supports a single language, you'll do fine by setting the correct collation at the db level -- in your case, utf8_danish_ci indeed, since it's the same as norwegian if wikipedia is anything to go by.
In case you'd like to read more on collation, the ICU docs has colorful examples of how thorny this kind of stuff gets. Quoting extensively:
http://userguide.icu-project.org/collation
[H]ere are some of the ways languages vary in ordering strings:
The letters A-Z can be sorted in a different order than in English.
For example, in Lithuanian, "y" is sorted between "i" and "k".
Combinations of letters can be treated as if they were one letter. For
example, in traditional Spanish "ch" is treated as a single letter,
and sorted between "c" and "d".
Accented letters can be treated as minor variants of the unaccented
letter. For example, "é" can be treated equivalent to "e".
Accented letters can be treated as distinct letters. For example, "Å"
in Danish is treated as a separate letter that sorts just after "Z".
Unaccented letters that are considered distinct in one language can be
indistinct in another. For example, the letters "v" and "w" are two
different letters according to English. However, "v" and "w" are
considered variant forms of the same letter in Swedish.
A letter can be treated as if it were two letters. For example, in
traditional German "ä" is compared as if it were "ae".
Thai requires that the order of certain letters be reversed.
French requires that letters sorted with accents at the end of the
string be sorted ahead of accents in the beginning of the string. For
example, the word "côte" sorts before "coté" because the acute accent
on the final "e" is more significant than the circumflex on the "o".
Sometimes lowercase letters sort before uppercase letters. The reverse
is required in other situations. For example, lowercase letters are
usually sorted before uppercase letters in English. Latvian letters
are the exact opposite.
Even in the same language, different applications might require
different sorting orders. For example, in German dictionaries, "öf"
would come before "of". In phone books the situation is the exact
opposite.
Sorting orders can change over time due to government regulations or
new characters/scripts in Unicode.
Please bear in mind that Collation != Encoding.
Encoding is the mapping between integers (which all that a database can store at the end of the day) and human readable graphic representations of characters.
Collation is the ordering rule used to sort characters according to the conventional alphabetical order of a given language. Notice that this ordering does not reflect the actual order of the internal, numeric representation.
Your question reduces to the following: what kind of alphabetical order should you use in your application? This cannot be answered.
I'm not 100% sure but I believe the utf8_danish_ci
is a subset of (either) utf8
(collation).
That being said, and if your database is utf8
encoded, it makes little sense to use the danish collation.
Quick test (since I'm in a hurry and I can't find a collation list for utf8_unicode
):
- create a table with all these chars (both lower and uppercase) with collation
utf8_danish_ci
- select all records, ordered by char ASC
- change the collation of the table to
utf8_general_ci
or preferably utf8_unicode_ci
- if the characters have the same order in both queries, chose either, doesn't make a difference
Related link.
UPDATE
My hypothesis was wrong.
I did some tests and apparently utf8_unicode_ci
does not sort in the same order, so nevermind.
Collation determines both which characters can be stored in a table, as well as the order of characters. Choosing anything starting with utf8 should cover most needs for character storage so utf8_general_ci is a good choice. If you're going to focus on one language, you can choose the local collation, such as utf8_danish_ci which means the order will be be Danish characters and case insensitive (the ci part).
For a multilingual application, you can store fields with utf8_general_ci and when you're in need of a specific sort or comparison to match your preferred language, add the COLLATE word to your query with your preferred collation from https://dev.mysql.com/doc/refman/5.6/en/charset-unicode-sets.html
The answer provided by @Denis above, claiming you cannot sort in MySQL, is wrong in my experience.