MySQL does not treat ı as i?

2020-05-03 10:39发布

问题:

I have a user table in MySQL 5.7.27 with utf8mb4_unicode_ci collation.

Unfortunately, ı is not threaded as i for example, the below query won't find Yılmaz

select id from users where name='Yilmaz';

I do not have the problem with other umlautes like ä and a. For example the two queries give the exact same result.

select id from users where name='Märie';

select id from users where name='Marie';

I cannot simply replace ı by i and to the search, because then I would not find users with the name Yılmaz.

Do I have to use different collation to support all umlaute?

Here are some more information about the unicode letters:

code    | glyph |decimal |  html   | description
U+0131  |  ı    |305     |ı |  Latin Small Letter dotless I
U+0069  |  i    |105     |-        |  Latin Small Letter I

回答1:

Referring to http://mysql.rjweb.org/utf8_collations.html , I see that ı=i in 3 collations: utf8_general_ci, utf8_general_mysql500_ci, utf8_turkish_ci. However, for the turkish collation, I=ı sorts before other accented I's. In all other collations ı sorts after all I's, as if it is treated as a separate letter.

Meanwhile İ=I in all collations except utf8_turkish_ci.

The plot thickens with MySQL 8.0. utf8mb4_tr_0900_ai_ci (only) has this ordering:

I=Ì=Í=Î=Ï=Ĩ=Ī=Ĭ=Į=ı sort before  i=ì=í=î=ï=ĩ=ī=ĭ=į=İ

Meanwhile ä=Ä and they match most other accented A's for most collations (including the Turkish ones).

Bottom line: It seems that utf8[mb4]_general_ci is the only collation in 5.7 or 8.0 that will always treat a dotless-i (or dotted-I) equal to a 'regular i/I and at the same time ignore umlauts.

Caveat: The "general" collations do not test more than one character at a time. That is, a "non-spacing umlaut" plus a vowel will not be treated as equal to the combination.

In that link... The one character æ is sorted the same as the two letters ae for some collations. That's indicated by: Aa ae=æ az. In about half of the other collations, the character æ is treated as a separate letter; this is indicated by it being after az and before b. Or even after zz for Scandinavian collations. This separate letter concept sometimes applies to letter pairs, for example cs (Hungarian) and ch (traditional Spanish).