choosing table collation for universal characters

2019-07-22 14:22发布

问题:

I'm working on a backend that needs to store universal characters.

I've chosen utf8mb4 Table Encoding for that purpose. I also have to choose Table Collation.

The most straightforward option is to choose utf8mb4_general_ci Table collation. Besides the general one, there is also about 20 others collations to choose from.. What is the purpose of the more specific ones? Does utf8mb4_general_ci or maybe utf8mb4_unicode520_ci cover all of them? Which one should I use if I want to store characters ranging from chinese all the way to arab.

回答1:

  • ...general_ci is simple. It does not equate 2-character combinations (such as with a non-spacing mark) with the single-character equivalent.

  • ...unicode_520_ci comes from Unicode version 5.20, the latest version available when MySQL picked up on it. It handles things like having an ordering for Emoji, which previous versions did not have.

  • With MySQL 8.0, the preferred collation is utf8mb4_0900_ai_ci, based on Unicode 9.0.

  • ...<language>_ci handles variations found in the given language. For example, should ch and ll in Spanish be treated as "letters" and sort between cz and d, and lz and m.

  • For general use, do not use ...general_ci, use the latest version derived from Unicode. For language-specific situations, pick one of the other collations.

  • I do know know how (or even whether) Chinese and Arabic are sorted differently in the different collations. However, I see ...persion_ci, so I suspect there is an issue.

  • Do use utf8mb4, not utf8, especially since you need Chinese.