在MySQL中检测UTF8破字(Detecting utf8 broken characters i

2019-07-19 14:50发布

站内文章 / 前端开发

20 0

乱世女痞

女 | 书童

私信

我得和一群分散在多个表坏UTF8字符的数据库。人物的名单是不是很广泛AFAIK（áéíúóÁÉÍÓÚÑñ）

固定给定的表是非常简单的

update orderItem set itemName=replace(itemName,'Ã¡','á');

但我不能得到检测断字的方式。如果我这样做

SELECT * FROM TABLE WHERE field LIKE "%Ã%";

我得到的，因为核对（A = A）的几乎所有领域。所有破碎的字符到目前为止开始以“A”。该数据库是在西班牙，因此不使用该特定字符

到目前为止，我已经得到了打破字符的列表

Ã¡ = á
Ã© = é
Ã- = í
Ã³ = ó
Ã± = ñ
Ã¡ = Á

如何任何想法做出选择此项工作打算？（二进制搜索或类似的东西）

Answer 1:

如何采用不同的方法，即列来回转换，以获得正确的字符集？你可以把它转换为二进制，然后为UTF-8，然后为ISO-8859-1或其他任何你正在使用。请参阅手册的细节。

Answer 2:

我用固定

UPDATE wp_zcs9ck_posts_copy SET post_title = 
    CONVERT(BINARY CONVERT(post_title USING latin1) USING utf8);

完整的解决方案： http://jonisalonen.com/2012/fixing-doubly-utf-8-encoded-text-in-mysql/

Answer 3:

UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã¡','á');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã¤','ä');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã©','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í©','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã³','ó');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'íº','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ãº','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã±','ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í‘','Ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã','í');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'â€“','–');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€™','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€¦','...');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€“','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€œ','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€˜','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€¢','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€¡','c');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Â','');

Answer 4:

无文本替换是一个通用的解决方案，因为你可以忘记一些字符。 对于双转换后的文字更合适的解决方法是：

转换回LATIN1
转换为二进制
转换为UTF8

像这样：

alter table descriptions modify name VARCHAR(2000) character set latin1;
alter table descriptions modify name blob;
alter table descriptions modify name VARCHAR(2000) character set utf8;

Answer 5:

谢谢您的回答！

我修好了我这个表，并希望分享变化的完整列表。请注意，它也包括固定的HTML解码后的字符，除了拉丁美洲的人，这真是一个烂摊子：

update `table` set `field` = replace(`field` ,'Ã‰','É');
update `table` set `field` = replace(`field` ,'â€œ','"');
update `table` set `field` = replace(`field` ,'â€','"');
update `table` set `field` = replace(`field` ,'Ã‡','Ç');
update `table` set `field` = replace(`field` ,'Ãƒ','Ã');
//Edit by slash4
update `table` set `field` = replace(`field` ,'Ã ','À');
update `table` set `field` = replace(`field` ,'Ãº','ú');
update `table` set `field` = replace(`field` ,'â€¢','-');
update `table` set `field` = replace(`field` ,'Ã˜','Ø');
update `table` set `field` = replace(`field` ,'Ãµ','õ');
-- The next one  appears to be missing a character. But which one?
update `table` set `field` = replace(`field` ,'Ã','í');
update `table` set `field` = replace(`field` ,'Ã¢','â');
update `table` set `field` = replace(`field` ,'Ã£','ã');
update `table` set `field` = replace(`field` ,'Ãª','ê');
update `table` set `field` = replace(`field` ,'Ã¡','á');
update `table` set `field` = replace(`field` ,'Ã©','é');
update `table` set `field` = replace(`field` ,'Ã³','ó');
update `table` set `field` = replace(`field` ,'â€“','–');
update `table` set `field` = replace(`field` ,'Ã§','ç');
update `table` set `field` = replace(`field` ,'Âª','ª');
update `table` set `field` = replace(`field` ,'Âº','º');
update `table` set `field` = replace(`field` ,'Ã ','à');
update `table` set `field` = replace(`field` ,'&ccedil;','ç');
update `table` set `field` = replace(`field` ,'&atilde;','ã');
update `table` set `field` = replace(`field` ,'&aacute;','á');
update `table` set `field` = replace(`field` ,'&acirc;','â');
update `table` set `field` = replace(`field` ,'&eacute;','é');
update `table` set `field` = replace(`field` ,'&iacute;','í');
update `table` set `field` = replace(`field` ,'&otilde;','õ');
update `table` set `field` = replace(`field` ,'&uacute;','ú');
update `table` set `field` = replace(`field` ,'&ccedil;','ç');
update `table` set `field` = replace(`field` ,'&Aacute;','Á');
update `table` set `field` = replace(`field` ,'&Acirc;','Â');
update `table` set `field` = replace(`field` ,'&Eacute;','É');
update `table` set `field` = replace(`field` ,'&Iacute;','Í');
update `table` set `field` = replace(`field` ,'&Otilde;','Õ');
update `table` set `field` = replace(`field` ,'&Uacute;','Ú');
update `table` set `field` = replace(`field` ,'&Ccedil;','Ç');
update `table` set `field` = replace(`field` ,'&Atilde;','Ã');
update `table` set `field` = replace(`field` ,'&Agrave;','À');
update `table` set `field` = replace(`field` ,'&Ecirc;','Ê');
update `table` set `field` = replace(`field` ,'&Oacute;','Ó');
update `table` set `field` = replace(`field` ,'&Ocirc;','Ô');
update `table` set `field` = replace(`field` ,'&Uuml;','Ü');
update `table` set `field` = replace(`field` ,'&atilde;','ã');
update `table` set `field` = replace(`field` ,'&agrave;','à');
update `table` set `field` = replace(`field` ,'&ecirc;','ê');
update `table` set `field` = replace(`field` ,'&oacute;','ó');
update `table` set `field` = replace(`field` ,'&ocirc;','ô');
update `table` set `field` = replace(`field` ,'&uuml;','ü');
update `table` set `field` = replace(`field` ,'&amp;','&');
update `table` set `field` = replace(`field` ,'&gt;','>');
update `table` set `field` = replace(`field` ,'&lt;','<');
update `table` set `field` = replace(`field` ,'&circ;','ˆ');
update `table` set `field` = replace(`field` ,'&tilde;','˜');
update `table` set `field` = replace(`field` ,'&uml;','¨');
update `table` set `field` = replace(`field` ,'&cute;','´');
update `table` set `field` = replace(`field` ,'&cedil;','¸');
update `table` set `field` = replace(`field` ,'&quot;','"');
update `table` set `field` = replace(`field` ,'&ldquo;','“');
update `table` set `field` = replace(`field` ,'&rdquo;','”');
update `table` set `field` = replace(`field` ,'&lsquo;','‘');
update `table` set `field` = replace(`field` ,'&rsquo;','’');
update `table` set `field` = replace(`field` ,'&lsaquo;','‹');
update `table` set `field` = replace(`field` ,'&rsaquo;','›');
update `table` set `field` = replace(`field` ,'&laquo;','«');
update `table` set `field` = replace(`field` ,'&raquo;','»');
update `table` set `field` = replace(`field` ,'&ordm;','º');
update `table` set `field` = replace(`field` ,'&ordf;','ª');
update `table` set `field` = replace(`field` ,'&ndash;','–');
update `table` set `field` = replace(`field` ,'&mdash;','—');
update `table` set `field` = replace(`field` ,'&macr;','¯');
update `table` set `field` = replace(`field` ,'&hellip;','…');
update `table` set `field` = replace(`field` ,'&brvbar;','¦');
update `table` set `field` = replace(`field` ,'&bull;','•');
update `table` set `field` = replace(`field` ,'&para;','¶');
update `table` set `field` = replace(`field` ,'&sect;','§');
update `table` set `field` = replace(`field` ,'&sup1;','¹');
update `table` set `field` = replace(`field` ,'&sup2;','²');
update `table` set `field` = replace(`field` ,'&sup3;','³');
update `table` set `field` = replace(`field` ,'&frac12;','½');
update `table` set `field` = replace(`field` ,'&frac14;','¼');
update `table` set `field` = replace(`field` ,'&frac34;','¾');
update `table` set `field` = replace(`field` ,'&#8539;','⅛');
update `table` set `field` = replace(`field` ,'&#8540;','⅜');
update `table` set `field` = replace(`field` ,'&#8541;','⅝');
update `table` set `field` = replace(`field` ,'&#8542;','⅞');
update `table` set `field` = replace(`field` ,'&gt;','>');
update `table` set `field` = replace(`field` ,'&lt;','<');
update `table` set `field` = replace(`field` ,'&plusmn;','±');
update `table` set `field` = replace(`field` ,'&minus;','−');
update `table` set `field` = replace(`field` ,'&times;','×');
update `table` set `field` = replace(`field` ,'&divide;','÷');
update `table` set `field` = replace(`field` ,'&lowast;','∗');
update `table` set `field` = replace(`field` ,'&frasl;','⁄');
update `table` set `field` = replace(`field` ,'&permil;','‰');
update `table` set `field` = replace(`field` ,'&int;','∫');
update `table` set `field` = replace(`field` ,'&sum;','∑');
update `table` set `field` = replace(`field` ,'&prod;','∏');
update `table` set `field` = replace(`field` ,'&radic;','√');
update `table` set `field` = replace(`field` ,'&infin;','∞');
update `table` set `field` = replace(`field` ,'&asymp;','≈');
update `table` set `field` = replace(`field` ,'&cong;','≅');
update `table` set `field` = replace(`field` ,'&prop;','∝');
update `table` set `field` = replace(`field` ,'&equiv;','≡');
update `table` set `field` = replace(`field` ,'&ne;','≠');
update `table` set `field` = replace(`field` ,'&le;','≤');
update `table` set `field` = replace(`field` ,'&ge;','≥');
update `table` set `field` = replace(`field` ,'&there4;','∴');
update `table` set `field` = replace(`field` ,'&sdot;','⋅');
update `table` set `field` = replace(`field` ,'&middot;','·');
update `table` set `field` = replace(`field` ,'&part;','∂');
update `table` set `field` = replace(`field` ,'&image;','ℑ');
update `table` set `field` = replace(`field` ,'&real;','ℜ');
update `table` set `field` = replace(`field` ,'&prime;','′');
update `table` set `field` = replace(`field` ,'&Prime;','″');
update `table` set `field` = replace(`field` ,'&deg;','°');
update `table` set `field` = replace(`field` ,'&ang;','∠');
update `table` set `field` = replace(`field` ,'&perp;','⊥');
update `table` set `field` = replace(`field` ,'&nabla;','∇');
update `table` set `field` = replace(`field` ,'&oplus;','⊕');
update `table` set `field` = replace(`field` ,'&otimes;','⊗');
update `table` set `field` = replace(`field` ,'&alefsym;','ℵ');
update `table` set `field` = replace(`field` ,'&oslash;','ø');
update `table` set `field` = replace(`field` ,'&Oslash;','Ø');
update `table` set `field` = replace(`field` ,'&isin;','∈');
update `table` set `field` = replace(`field` ,'&notin;','∉');
update `table` set `field` = replace(`field` ,'&cap;','∩');
update `table` set `field` = replace(`field` ,'&cup;','∪');
update `table` set `field` = replace(`field` ,'&sub;','⊂');
update `table` set `field` = replace(`field` ,'&sup;','⊃');
update `table` set `field` = replace(`field` ,'&sube;','⊆');
update `table` set `field` = replace(`field` ,'&supe;','⊇');
update `table` set `field` = replace(`field` ,'&exist;','∃');
update `table` set `field` = replace(`field` ,'&forall;','∀');
update `table` set `field` = replace(`field` ,'&empty;','∅');
update `table` set `field` = replace(`field` ,'&not;','¬');
update `table` set `field` = replace(`field` ,'&and;','∧');
update `table` set `field` = replace(`field` ,'&or;','∨');
update `table` set `field` = replace(`field` ,'&crarr;','↵');

Answer 6:

该SELECT需要声明如下：

SELECT * FROM TABLE WHERE LENGTH(name) != CHAR_LENGTH(name);

这将返回包含多字节字符的所有行。

name被认为是一个字段/哪里奇怪的字符会被发现场。 *

Answer 7:

这救了我的命

UPDATE ohp_posts SET post_content = CONVERT(CAST(CONVERT(post_content USING latin1) AS BINARY) USING utf8)

我发现在这里http://stanis.net/2014/04/replacing-latin-1-with-utf-8-characters-in-mysql/

Answer 8:

我有同样的问题，但并没有像更换（）解决方案，因为总有一些丢失字符的可能性。我正在对混合数据的列（一些已经函数utf8_encode（）d和一些不）400万行左右，约25万条记录与错误编码数据（与‰/等字符），占地约15种国际语言，主要包括欧洲语言，而且俄罗斯，日本和中国。

我开始通过复制列，因为我不想丢失任何数据：

ALTER TABLE images ADD COLUMN reptitle TEXT;

复制的所有具有多字节字符（感谢亚当的尖端）的数据

UPDATE images SET reptitle = title WHERE LENGTH(title) != CHAR_LENGTH(title)

由于reptitle与表的默认字符集创建它已经是UTF8，但包含损坏的数据，因为图像表曾经是一个拉丁来源。列reptitle现在包含一些数据是正确编码，以及一些损坏的（所有带多字节字符值，一些已正确函数utf8_encode（）d。所以后来与大卫的提示...

ALTER TABLE images MODIFY reptitle TEXT character set latin1;
ALTER TABLE images MODIFY reptitle BLOB;
ALTER TABLE images MODIFY reptitle TEXT character set utf8;

因为TEXT和BLOB（我认为）是相同的中间步骤可不能是必要的。这不得不修正所有错误编码数据（成为“étudiantes”“A©tudiantes”等）（成为“拉平去P”“落聘德帕克”）的影响，但是这在以前是正确的，在第一个多字节字符被截断的数据。我不知道为什么截断，但它在一次性柱，所以我也没在意。截掉的数据给出了CHAR_LENGTH和相同的价值观的长度，因为没有剩余的那么容易查询多字节字符...

UPDATE images SET title = reptitle WHERE LENGTH(reptitle)!=CHAR_LENGTH(reptitle)

然后，当然刚落备用列

ALTER TABLE images DROP COLUMN reptitle

另外，还要确保（因为我使用PHP，这绊倒了我几次，所以我想我会在这里提到它），你的所有脚本文件是UTF8（无BOM），并且使用：

mysql_set_charset('utf8', $connection);

等瞧...完美修复的数据，所有的语言:)

Answer 9:

除了劳尔·阿维拉索拉诺和acseven的答案，如果你要更新一个查询就可以完成所有的碎字符 ：

update `table` set field = replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(field,'&uuml;','ü'),'&ocirc;','ô'),'&oacute;','ó'),'&ecirc;','ê'),'&agrave;','à'),'&atilde;','ã'),'&Uuml;','Ü'),'&Ocirc;','Ô'),'&Oacute;','Ó'),'&Ecirc;','Ê'),'&Agrave;','À'),'&Atilde;','Ã'),'&Ccedil;','Ç'),'&Uacute;','Ú'),'&Otilde;','Õ'),'&Iacute;','Í'),'&Iacute;','Í'),'&Eacute;','É'),'&Acirc;','Â'),'&Aacute;','Á'),'&ccedil;','ç'),'&uacute;','ú'),'&otilde;','õ'),'&iacute;','í'),'&eacute;','é'),'&acirc;','â'),'&aacute;','á'),'&atilde;','ã'),'&ccedil;','ç'),'Ã ','à'),'Ã ','à'),'Âº','º'),'Âª','ª'),'Ã§','ç'),'â€“','–'),'Ã³','ó'),'Ã©','é'),'Ã¡','á'),'Ãª','ê'),'Ã£','ã'),'Ã¢','â'),'Ã','í'),'Ãµ','õ'),'Ã˜','Ø'),'â€¢','-'),'Ãº','ú'),'Ã ','À'),'Ãƒ','Ã'),'Ã‡','Ç'),'â€','"'),'â€œ','"'),'Ã‰','É');

Answer 10:

这也解决了我的问题对一些意大利字符

UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã¡','á');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã¤','ä');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã©','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í©','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã³','ó');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'íº','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ãº','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã±','ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í‘','Ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã','í');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'â€“','–');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€™','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€¦','...');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€“','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€œ','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€˜','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€¢','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€¡','c');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Â','');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í ','à');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í¨','è');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'íˆ','È');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'â‚¬','€');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'eÌ€','è');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í²','ò');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í¹','ù');

Answer 11:

你可能有行与正确编码UTF8与编码错误的字符。在这种情况下，“转换（二进制格式转换（POST_TITLE使用LATIN1）用UTF8）”将削减一些领域。

最后我做这种方式

update `table` set `name` = replace(`name` ,CONVERT(BINARY "ä" USING latin1),'ä');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "ö" USING latin1),'ö');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "ü" USING latin1),'ü');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "Ä" USING latin1),'Ä');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "Ö" USING latin1),'Ö');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "Ü" USING latin1),'Ü');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "ß" USING latin1),'ß');

Answer 12:

基于在这个岗位数据https://www.i18nqa.com/debug/utf8-debug.html我建议这是识别狡猾条目和可能正确的值的一个很好的查询：

SELECT my_field,CONVERT(BINARY CONVERT(my_field USING latin1) USING utf8mb4) AS new_field_value FROM my_table WHERE my_field REGEXP '[âÆËÅÂÃ]';

要非常小心，因为我们有一个文件名的坏编码，但路径的确定编码，并且在这种情况下，一些解决方案上面会造成痛苦的世界。如果你的一些数据已经在正确的utf-8编码，你可能会发现你失去的是一大块。

Answer 13:

由于中间步骤可能没有必要TEXT和BLOB是相同的。

这不得不校正所有错误的编码数据，但是这在以前是正确的第一多字节字符被截断数据的效果。

Answer 14:

有一个很好的脚本来自动转换过程在整个数据库。这也是有用的知道，MySQL的UTF-8的实现是不完整的，因为它仅支持UTF-8字符最多3个字节。该解决方案是使用在MySQL 5.5.3推出了utf8mb4字符集。

Answer 15:

这是@Thales Ceolin的，以修改每个表在DB答案的扩展：

select concat(
    "update ", 
    a.TABLE_NAME, 
    " set ", b.COLUMN_NAME, 
    " = CONVERT(BINARY CONVERT(", 
    b.COLUMN_NAME, 
    " USING latin1) USING utf8) where ",
    b.COLUMN_NAME, 
    " is not null;") query
from INFORMATION_SCHEMA.TABLES a
left join INFORMATION_SCHEMA.COLUMNS b on a.TABLE_NAME = b.TABLE_NAME
where a.table_schema = 'db_name'
and a.TABLE_TYPE = 'BASE TABLE'
and b.data_type in ('text', 'varchar')
and a.TABLE_NAME = 'table_name';

这将导致：

update table_name set idn = CONVERT(BINARY CONVERT(idn USING latin1) USING utf8) where idn is not null;
update table_nameset name = CONVERT(BINARY CONVERT(name USING latin1) USING utf8) where name is not null;
update table_name set primary_last_name = CONVERT(BINARY CONVERT(primary_last_name USING latin1) USING utf8) where primary_last_name is not null;

Answer 16:

作为主要的问题是在检测到断裂字符我的解决方案：（以防止在正常的charset双编码）

检测（LATIN1为utf8）

SELECT name FROM %table% 
 WHERE 
CONVERT(CONVERT(name USING BINARY) USING utf8 ) != CONVERT(CONVERT(CONVERT(CONVERT(name USING BINARY) USING latin1) USING BINARY) USING utf8);

更新（LATIN1为utf8）

UPDATE %table% SET name = convert(cast(convert(name using latin1 ) as binary) using utf8 )
 WHERE 
CONVERT(CONVERT(name USING BINARY) USING utf8 ) != CONVERT(CONVERT(CONVERT(CONVERT(name USING BINARY) USING latin1) USING BINARY) USING utf8);

文章来源: Detecting utf8 broken characters in MySQL