I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
????
instead of Asian characters. Even for European text, I gotSe?or
forSeñor
.- Strange gibberish (Mojibake?) such as
Señor
or新浪新闻
for新浪新闻
. - Black diamonds, such as Se�or.
- Finally, I got into a situation where the data was lost, or at least truncated:
Se
forSeñor
. - Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
Funny how you answer your own question :)
Set your code IDE language to UTF8
Add to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
If you are using PDO, make sure
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF8. Good luck
I had similar issues with 2 of my projects, after a server migration. After searching and trying a lot of solutions i came across with this one:
After adding this line to my config file everything works fine!
I found this solution for mysqli https://www.w3schools.com/PHP/func_mysqli_set_charset.asp when i was looking to solve a insert from html query
good luck!
Depending on how the server is setup you have to change the encode accordingly. utf8 from what you said should work the best however if your getting weird characters it might help if you change the webpage Encode to Ansi. This helped me when I was setting up a PHP MYSQLI this might help you understand more https://superuser.com/questions/762473/ansi-to-utf-8-in-notepad
This problem plagues the participants of this site, and many others.
You have listed the five main cases of
CHARACTER SET
troubles.Best Practice
Going forward, it is best to use
CHARACTER SET utf8mb4
andCOLLATION utf8mb4_unicode_520_ci
. (There is a newer version of the Unicode collation in the pipeline.)utf8mb4
is a superset ofutf8
in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's
utf8mb4
, notutf8
.I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
<form accept-charset="UTF-8">
.CHARACTER SET utf8mb4
(Check withSHOW CREATE TABLE
.)<meta charset=UTF-8>
at the beginning of HTMLUTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with
SELECT
cannot be trusted. Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled. So, pick a table and column that has some non-English text and doThe HEX for correctly stored UTF-8 will be
20
4x
,5x
,6x
, or7x
Cxyy
Dxyy
Exyyzz
F0yyzzww
Specific causes and fixes of the problems seen
Truncated text (
Se
forSeñor
):Black Diamonds with question marks (
Se�or
forSeñor
); one of these cases exists:Case 1 (original bytes were not UTF-8):
SET NAMES
) for theINSERT
and theSELECT
was not utf8/utf8mb4. Fix this.CHARACTER SET utf8
(or utf8mb4).Case 2 (original bytes were UTF-8):
SET NAMES
) for theSELECT
was not utf8/utf8mb4. Fix this.CHARACTER SET utf8
(or utf8mb4).Black diamonds occur only when the browser is set to
<meta charset=UTF-8>
.Question Marks (regular ones, not black diamonds) (
Se?or
forSeñor
):CHARACTER SET utf8
(or utf8mb4). Fix this. (UseSHOW CREATE TABLE
.)Mojibake (
Señor
forSeñor
): (This discussion also applies to Double Encoding, which is not necessarily visible.)INSERTing
andSELECTing
text needs to specify utf8 or utf8mb4. Fix this.CHARACTER SET utf8
(or utf8mb4). Fix this.<meta charset=UTF-8>
.If the data looks correct, but won't sort correctly, then either you have picked the wrong collation, or there is no collation that suits your need, or you have Double Encoding.
Double Encoding can be confirmed by doing the
SELECT .. HEX ..
described above.I was also searching for the same issue, it took me nearly 1 month to find the appropriate solution. First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or atleast which support utf-8 data.
For Java:
while making JDBC connection add this to the connection url useUnicode=yes&characterEncoding=UTF-8 as params and it will work.
For python:
Before querying into the database try enforcing this over the cursor *
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
*If it does not work, happy hunting for the right solution.