I am in the process of fixing some bad UTF8 encoding. I am currently using PHP 5 and MySQL
In my database I have a few instances of bad encodings that print like: î
- The database collation is utf8_general_ci
- PHP is using a proper UTF8 header
- Notepad++ is set to use UTF8 without BOM
- database management is handled in phpMyAdmin
- not all cases of accented characters are broken
What I need is some sort of function that will help me map the instances of î, ÃÂ, ü and others like it to their proper accented UTF8 characters.
The way is to convert to binary and then to correct encoding
I've had to try to 'fix' a number of UTF8 broken situations in the past, and unfortunately it's never easy, and often rather impossible.
Unless you can determine exactly how it was broken, and it was always broken in that exact same way, then it's going to be hard to 'undo' the damage.
If you want to try to undo the damage, your best bet would be to start writing some sample code, where you attempt numerous variations on calls to mb_convert_encoding() to see if you can find a combination of 'from' and 'to' that fixes your data. In the end, it's often best to not even bother worrying about fixing the old data because of the pain levels involved, but instead to just fix things going forward.
However, before doing this, you need to make sure that you fix everything that is causing this issue in the first place. You've already mentioned that your DB table collation and editors are set properly. But there are more places where you need to check to make sure that everything is properly UTF-8:
If you miss up on any one step through your whole process, the encoding can be mangled and problems arise. Once you get in the 'groove' of doing utf-8 though, this all becomes second nature. And of course, PHP6 is supposed to be fully unicode complaint from the getgo, which will make lots of this easier (hopefully)
This script had a nice approach. Converting it to the language of your choice should not be too difficult:
http://plasmasturm.org/log/416/
I had a problem with an xml file that had a broken encoding, it said it was utf-8 but it had characters that where not utf-8.
After several trials and errors with the
mb_convert_encoding()
I manage to fix it withI know this isn't very elegant, but after it was mentioned that the strings may be double encoded, I made this function:
This seems to work perfectly to remove the double encoding I am experiencing. I am probably missing some of the characters that could be an issue to others. However, for my needs it is working perfectly.
It looks like your utf-8 is being interpreted as iso8859-1 or Win-1250 at some point.
When you say "In my database I have a few instances of bad encodings" - how did you check this? Through your app, phpmyadmin or the command line client? Are all utf-8 encodings showing up like this or only some? Is it possible you had the encodings wrong and it has been incorrectly converted from iso8859-1 to utf-8 when it was utf-8 already?