I am having great problems solving this one:
I have a mysql database encoding latin1_swedish_ci and a table that stores names and addresses.
I am trying to output a UTF-8 XML file, but I am having problems with the following string:
Otivägen
it is being outputted as Otivägen
when i vim the file. Also when opened it IE i get
"An invalid character was found in text content. Error processing resource
"
I have the following code:
function fixEncoding($in_str)
{
$cur_encoding = mb_detect_encoding($in_str) ;
if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
return $in_str;
else
return utf8_encode($in_str);
}
header("Content-type: text/plain;charset=utf-8");
$mystring = "Otivägen" // this is actually obtained from database;
$myxml = "<myxml>
....
<node>".$mystring."</node>
....
</myxml>
";
$myxml = fixEncoding($myxml);
The actual XML output is below:
<?xml version="1.0" encoding="UTF-8" ?>
<myxml>
....
<node>Otivägen</node>
....
</myxml>
Any ideas how I can output the file so in vim the file reads Otivägen
and not Otivägen
?
EDIT:
I did mysql_client_encoding()
and got latin1
I then did mysql_set_charset()
and again ran mysql_client_encoding()
and got utf8, but still the same outputting issues.
Edit 2
I have logged into the command line and run the query SELECT address1 FROM address WHERE id = 1000;
SELECT address1 FROM address WHERE id = 1000;
Current database: ftpuser_db
+-------------+
| address1 |
+-------------+
| Otivägen 32 |
+-------------+
1 row in set (0.06 sec)
Thanks in advance!
Is your MySQL connection encoding properly set to
UTF-8
?Check mysql_set_charset() and mysql_client_encoding() for more details.
Oh boy. UTF8 issues can be a real pain and they get almost impossible to solve when something is doing re-encodings for you.
You really need to start at one end and make sure every process is UTF8. That will remove things in the process from interpreting the data wrong and 'converting' it for you. But significantly, it will also let you much more easily spot when something has already mis-encoded text for you (yes, I've had that problem).
And if you have UTF8 data in tables that aren't set to UTF8 and might be mis-encoded, you need to do the tables last, after the data has been re-encoded. Otherwise you will damage your data irretrievably. I've had that problem, too.
First steps:
:set encoding
This will mean that your files will be edited in UTF8.
Now we check MySQL.
In the MySQL CLI, do
show variables like 'character_set%';
. The results will probably be something like:What you're aiming for is to change all those
latin1
values (or whatever you're seeing) toutf8
.set names utf8;
will change most of them and you might need to do that with every new connection in your database. This was the solution I had to adopt in a previous application. The other settings to change are in the my.cnf file for which I need to direct you to the documentation. It is unlikely you will need to set them all.I see you're already setting the output headers, so that's good.
Now you can look at the data from the database and see why it's "wrong".
before output run query
SET NAMES utf8
after output you can go back and run
SET NAMES latin1
Look here, I've got the same problem
latin1_swedish_ci
is a collation, not a charset. Since collations are supposed to match their charset, it suggests that the table is using latin1, but it's not a guarantee.Strictly speaking, the charset of tables is irrelevant here, since MySql can convert input/output. That's what the connection charset (
mysql_set_charset
) is for. However, for that to work properly, the data needs to be encoded properly in the database. I would begin by checking that strings are correct in the database. Simplest thing is to log in on the command line and select a row which has non-ascii characters in it. Does it look OK?Watch out. The encoding of the data in
$mystring
will now depend on the encoding of the php file. That may or may not be the same as the data in the database.It seems you are "double encoding" Otivägen. You get this behaviour if Otivägen already is UTF-8, and run utf8_encode() on it again. Example:
I'm not sure we're the actual "double encoding" occurs, but it may be due to settings in your editor. My theory. Lets say you are running Aptana Studio: Your actual character set is set to ISO-8859-1 (in Aptana, you can check this by right clicking on a file and choose "properties". To set default character encoding for all projects, choose Preferences from Aptana main menu -> General -> workspace). If that's the case, the actual PHP source file where you have
$myxml
and its string<myxml><node>...
is detected to be ISO-8859-1, but $mystring received from the database is UTF-8. Your fixEncoding function would then run the else clause, since the $myxml as a whole is seen as ISO-8859-1 and not UTF-8. This results in double encoding the results from the database, and may be the cause to your problem.Check the encoding of your actual source file in your editor, and verify that it is set to UTF-8. Alternatively, experiment with applying or removing fixEncoding/utf8_encode/utf8_decode to $myxml. Observe the results and see what needs to be done to the value Otivägen right.
I think you did everything correctly, except that your terminal is in Latin-1.
The UTF-8 sequence for ä is C3 A4, which is ä if displayed as Latin-1.