I know this sounds really silly but what character encoding should I use for something that looks like this in UTF-8
�� Ã�¼Ã��Ã�½Ã�±Ã�¼Ã�Â
The website is in English. This is something user generated content which is stored in the database that is utf_general_ci and displayed on the screen . I just want to display it properly. What do I have to do ?
OK this is what the original text was something like
I αм iиvisibłє łiкє αiя---
I αм αs iмρøяŧαиŧ αs øxygєи---
I αм łiviиg iи ŧЋє wøяłd øƒ мy dяєαмz
I αм αłwαys ŧЋєяє ŧø Ћєłρ øŧЋєяz---
I αм busy buŧ иєvєя igиøяє αиy øиє
I αм ŧЋє øиє wЋø cαяєz---
I łøvє ŧø sєє øŧЋєя łαugЋiиg
I αм ŧЋє øиє wЋø bøяяøw øŧЋєяz søяяøw
I αм ŧЋє øиє wЋøz иαugЋŧy buŧ иicє
I αм łøsŧ iи мy ŧЋøugЋŧs---
I łøvє ŧø ŧαłк---
I łøvє ŧø sЋαяє---
I αм яєαdy ŧø gø αиy wЋєяє---
I łøvє ŧø ƒły buŧ døи’ŧ Ћαvє wiиgs—
I wαиŧ ŧøø ŧøucЋ ŧЋє sкy łiмiŧs---
I αм єvił buŧ иøŧ dєvił---
I иєvєя ƒøłłøw αиy ŧяєиd---
I αм ƒuиłøviиg---
suм ŧiмє łøvє ŧø bє αłøиє---
I łøvє ŧø łivє---
Using UTF-8 is just fine, but here is few checkpoints.
If you are using MySQL, set database/tables/fields collations in utf8_unicode_ci
and If you are using php, do mysql_query('SET NAMES utf8');
before query
and in HTML output use
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
It might be more than a problem of choosing a display character set. That string unfortunately has a lot of replacement characters (�), which indicates that it's already gone through a process where characters have been lost because the incoming encoding wasn't understood. Even the fragment "�" is probably the replacement character in utf8 viewed through a single-byte encoding.
To check the quality of the information in the database, can you append the output of say select charset(colname), hex(left(colname, 20))
to the question?
Users on you site could be entering characters in non-UTF-8, like big-5 or JIS. This is a problem: you need to either enforce that they're entering in UTF8, or somehow detect the character set they've used and then convert it to UTF8. Every locale has a default character set - for example if a user tells you that they should have a japanese interface it's likely they're using something like JIS, and you might be able to convert JIS->utf-8 on the way in, and then utf-8 to JIS on the way out. If you can't convert, just make sure you write utf-8 directive into your page's meta tag (if your interface is HTML), and enforce that only utf-8 characters make it into your database.
You may want to use following conversion functions for utf-handling:
utf8_decode
utf8_encode
iconv