utf8 collation difference between unicode and dani

2019-06-25 23:53发布

问题:

Well hello there. I'm changing the charset of a database from latin1_sweedish_ci to utf8. I've allways used utf8_danish_ci because it's the closest to the Norwegian character style - I think.
But what about utf8_general_ci and utf8_unicode_ci?

Some time ago; It was preferred to use _general_ci for better/faster performance, and _unicode_ci for more accuracy because of the sorting algorithm is more complex in the latter. But since the speed/performace is no longer an issue - or not so much an issue i most cases anymore - _unicode_ci is ok to use in most situations?

But how does _unicode_ci differ from _danish_ci?
Is it the last three letters æ, ø, å in the nordic alphabet that is taken into account?

Most comparisson (one vs the other) I can find is only between _general_ci and _unicode_ci.

Anyone know of any examples for when to use _unicode_ci or when to use _danish_ci would be highly appreciated...

回答1:

In short, if your app is multilingual and storing multiple languages within the same tables, you're mostly screwed and should worry about sorting/collating outside of the database -- utf8_general_ci is then as good as any other.

If it only supports a single language, you'll do fine by setting the correct collation at the db level -- in your case, utf8_danish_ci indeed, since it's the same as norwegian if wikipedia is anything to go by.

In case you'd like to read more on collation, the ICU docs has colorful examples of how thorny this kind of stuff gets. Quoting extensively:

http://userguide.icu-project.org/collation

[H]ere are some of the ways languages vary in ordering strings:

The letters A-Z can be sorted in a different order than in English. For example, in Lithuanian, "y" is sorted between "i" and "k".

Combinations of letters can be treated as if they were one letter. For example, in traditional Spanish "ch" is treated as a single letter, and sorted between "c" and "d".

Accented letters can be treated as minor variants of the unaccented letter. For example, "é" can be treated equivalent to "e".

Accented letters can be treated as distinct letters. For example, "Å" in Danish is treated as a separate letter that sorts just after "Z".

Unaccented letters that are considered distinct in one language can be indistinct in another. For example, the letters "v" and "w" are two different letters according to English. However, "v" and "w" are considered variant forms of the same letter in Swedish.

A letter can be treated as if it were two letters. For example, in traditional German "ä" is compared as if it were "ae".

Thai requires that the order of certain letters be reversed.

French requires that letters sorted with accents at the end of the string be sorted ahead of accents in the beginning of the string. For example, the word "côte" sorts before "coté" because the acute accent on the final "e" is more significant than the circumflex on the "o".

Sometimes lowercase letters sort before uppercase letters. The reverse is required in other situations. For example, lowercase letters are usually sorted before uppercase letters in English. Latvian letters are the exact opposite.

Even in the same language, different applications might require different sorting orders. For example, in German dictionaries, "öf" would come before "of". In phone books the situation is the exact opposite.

Sorting orders can change over time due to government regulations or new characters/scripts in Unicode.



回答2:

Please bear in mind that Collation != Encoding.

Encoding is the mapping between integers (which all that a database can store at the end of the day) and human readable graphic representations of characters.

Collation is the ordering rule used to sort characters according to the conventional alphabetical order of a given language. Notice that this ordering does not reflect the actual order of the internal, numeric representation.

Your question reduces to the following: what kind of alphabetical order should you use in your application? This cannot be answered.



回答3:

I'm not 100% sure but I believe the utf8_danish_ci is a subset of (either) utf8 (collation).

That being said, and if your database is utf8 encoded, it makes little sense to use the danish collation.

Quick test (since I'm in a hurry and I can't find a collation list for utf8_unicode):

  • create a table with all these chars (both lower and uppercase) with collation utf8_danish_ci
  • select all records, ordered by char ASC
  • change the collation of the table to utf8_general_ci or preferably utf8_unicode_ci
  • if the characters have the same order in both queries, chose either, doesn't make a difference

Related link.


UPDATE

My hypothesis was wrong.

I did some tests and apparently utf8_unicode_ci does not sort in the same order, so nevermind.



回答4:

Collation determines both which characters can be stored in a table, as well as the order of characters. Choosing anything starting with utf8 should cover most needs for character storage so utf8_general_ci is a good choice. If you're going to focus on one language, you can choose the local collation, such as utf8_danish_ci which means the order will be be Danish characters and case insensitive (the ci part).

For a multilingual application, you can store fields with utf8_general_ci and when you're in need of a specific sort or comparison to match your preferred language, add the COLLATE word to your query with your preferred collation from https://dev.mysql.com/doc/refman/5.6/en/charset-unicode-sets.html

The answer provided by @Denis above, claiming you cannot sort in MySQL, is wrong in my experience.