I am about to undertake the tedious and gotcha-laden task of converting a database from Latin1 to UTF-8.
At this point I simply want to check what sort of data I have stored in my tables, as that will determine what approach I should use to convert the data.
Specifically, I want to check if I have UTF-8 characters in the Latin1 columns, what would be the best way to do this? If only a few rows are affected, then I can just fix this manually.
Option 1. Perform a MySQL dump and use Perl to search for UTF-8 characters?
Option 2. Use MySQL CHAR_LENGTH to find rows with multi-byte characters?
e.g. SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name);
Is this enough?
At the moment I have switched my Mysql client encoding to UTF-8.
Since your question is not completely clear, let's assume some scenarios:
?
.There is a script on github to help with this sort of a thing.
Character encoding, like time zones, is a constant source of problems.
What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.
To figure out what encoding is correct, you just
SELECT
two different versions and compare visually. Here's an example:This is made unusually complicated because the MySQL regexp engine seems to ignore things like
\x80
and makes it necessary to use theUNHEX()
method instead.This produces results like this:
I would create a dump of the database and grep for all valid UTF8 sequences. Where to take it from there depends on what you get. There are multiple questions on SO about identifying invalid UTF8; you can basically just reverse the logic.
Edit: So basically, any field consisting entirely of 7-bit ASCII is safe, and any field containing an invalid UTF-8 sequence can be assumed to be Latin-1. The remaining data should be inspected - if you are lucky, a handful of obvious substitutions will fix the absolute majority (replace ö with Latin-1 ö, etc).