I'm running MySQL 5.1.50 and have a table that looks like this:
organizations | CREATE TABLE `organizations` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`url` text CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`phone` varchar(20) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=25837 DEFAULT CHARSET=utf8 |
The problem I'm having is that MySQL is matching unicode characters with ascii versions. For example when I search for a word with that contains an 'é', it will match the same word that has an 'e' instead, and vice versa:
mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT id, name FROM `organizations` WHERE `name` = 'Universite de Montreal';
+-------+-------------------------+
| id | name |
+-------+-------------------------+
| 16973 | Université de Montreal |
+-------+-------------------------+
1 row in set (0.01 sec)
I get these results both from PHP and the command line console. How can I get accurate matches from my SELECT queries?
Thanks!
You specified the name
column as text CHARACTER SET utf8 COLLATE utf8_unicode_ci
which tells MySQL to consider e and é as equivalent in matching and sorting. That collation and utf8_general_ci
both make a lot of things equivalent.
http://www.collation-charts.org/ is a great resource once you learn how to read the charts, which is pretty easy.
If you want e and é etc. to be considered different then you must choose a different collation. To find out what collations are on your server (assuming you're limited to UTF-8 encoding):
mysql> show collation like 'utf8%';
And choose using the collation charts as a reference.
One more special collation is utf8_bin
in which there are no equivalencies, it's a binary match.
The only MySQL Unicode collations I'm aware of that are not language specific are utf8_unicode_ci
, utf8_general_ci
and utf8_bin
. They are rather weird. The real purpose of a collation is to make the computer match and sort as a person from somewhere would expect. Hungarian and Turkish dictionaries have their entries ordered according to different rules. Specifying a collation allows you to sort and match according to such local rules.
For example, it seems Danes consider e and é equivalent but Icelanders don't:
mysql> select _utf8'e' collate utf8_danish_ci
-> = _utf8'é' collate utf8_danish_ci as equal;
+-------+
| equal |
+-------+
| 1 |
+-------+
mysql> select _utf8'e' collate utf8_icelandic_ci
-> = _utf8'é' collate utf8_icelandic_ci as equal;
+-------+
| equal |
+-------+
| 0 |
+-------+
Another handy trick is to fill a one column table with a bunch of characters you're interested in (it's easier from a script) and then MySQL can tell you the equivalencies:
mysql> create table t (c char(1) character set utf8);
mysql> insert into t values ('a'), ('ä'), ('á');
mysql> select group_concat(c) from t group by c collate utf8_icelandic_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a |
| á |
| ä |
+-----------------+
mysql> select group_concat(c) from t group by c collate utf8_danish_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,á |
| ä |
+-----------------+
mysql> select group_concat(c) from t group by c collate utf8_general_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,ä,á |
+-----------------+
Of course, this will work:
SELECT * FROM table WHERE name LIKE BINARY 'namé';
one thing you can do with your query string is to decode it...
< ?php
$query="उनकी"; // some Unicode characters
$query=urldecode($query);
$qry= "SELECT * FROM table WHERE books LIKE '%$query%'";
//rest of the code....
?>
it worked for me. :)
You have set collation to utf8_unicode_ci
which equates accented latin characters. Additional information can be found here.
I found out, that you get the requested result using REGEXP
SELECT * FROM table WHERE name REGEXP 'namé';
But this doesn't help if you try to group exactly by name.