可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a MySQL table with two columns, both utf8_unicode_ci collated. It contains the following rows. Except for ASCII, the second field also contains Unicode codepoints like U+02C8 (MODIFIED LETTER VERTICAL LINE) and U+02D0 (MODIFIED LETTER TRIANGULAR COLON).

 word   | ipa
--------+----------
 Hallo  | haˈloː
 IPA    | ˌiːpeːˈʔaː

I need to search the second field with LIKE and REGEXP, but MySQL (5.0.77) seems to interpret these fields as bytes, not as characters.

SELECT * FROM pronunciation WHERE ipa LIKE '%ha?lo%';  -- 0 rows
SELECT * FROM pronunciation WHERE ipa LIKE '%ha??lo%'; -- 1 row

SELECT * FROM pronunciation WHERE ipa REGEXP 'ha.lo';  -- 0 rows
SELECT * FROM pronunciation WHERE ipa REGEXP 'ha..lo'; -- 1 row

I'm quite sure that the data is stored correctly, as it seems good when I retrieve it and shows up fine in phpMyAdmin. I'm on a shared host, so I can't really install programs.

How can I solve this problem? If it's not possible: is there a plausible work-around that does not involve processing the entire database with PHP every time? There are 40 000 lines, and I'm not dead-set on using MySQL (or UTF8, for that matter). I only have access to PHP and MySQL on the host.

Edit: There is an open 4-year-old MySQL bug report, Bug #30241 Regular expression problems, which notes that the regexp engine works byte-wise. Thus, I'm looking for a work-around.

回答1:

EDITED to incorporate fix to valid critisism

Use the HEX() function to render your bytes to hexadecimal and then use RLIKE on that, for example:

select * from mytable
where hex(ipa) rlike concat('(..)*', hex('needle'), '(..)*'); -- looking for 'needle' in haystack, but maintaining hex-pair alignment.

The odd unicode chars render consistently to their hex values, so you're searching over standard 0-9A-F chars.

This works for "normal" columns too, you just don't need it.

p.s. @Kieren's (valid) point addressed using rlike to enforce char pairs

回答2:

I'm not dead-set on using MySQL

Postgres seems to handle it quite fine:

test=# select 'ˌˈʔ' like '___';
 ?column? 
----------
 t
(1 row)

test=# select 'ˌˈʔ' ~ '^.{3}$';
 ?column? 
----------
 t
(1 row)

If you go down that road, note that in Postgres' ilike operator matches that of MySQL's like. (In Postgres, like is case-sensitive.)

For the MySQL-specific solution, you mind be able to work around by binding some user-defined function (maybe bind the ICU library?) into MySQL.

回答3:

You have problems with UTF8? Eliminate them.

How many special characters do you use? Are you using only locase letters, am I right? So, my tip is: Write a function, which converts spec chars to regular chars, e.g. "æ" ->"A" and so on, and add a column to the table which stores that converted value (you have to convert all values first, and upon each insert/update). When searching, you just have to convert search string with the same function, and use it on that field with regexp.

If there're too many kind of special chars, you should convert it to multi-char. 1. Avoid finding "aa" in the "ba ab" sequence use some prefix, like "@ba@ab". 2. Avoid finding "@a" in "@ab" use fixed length tokens, say, 2.