在MySQL中检测UTF8破字(Detecting utf8 broken characters i

2019-07-19 14:50发布

我得和一群分散在多个表坏UTF8字符的数据库。 人物的名单是不是很广泛AFAIK(áéíúóÁÉÍÓÚÑñ)

固定给定的表是非常简单的

update orderItem set itemName=replace(itemName,'á','á');

但我不能得到检测断字的方式。 如果我这样做

SELECT * FROM TABLE WHERE field LIKE "%Ã%";

我得到的,因为核对(A = A)的几乎所有领域。 所有破碎的字符到目前为止开始以“A”。 该数据库是在西班牙,因此不使用该特定字符

到目前为止,我已经得到了打破字符的列表

á = á
é = é
í- = í
ó = ó
ñ = ñ
á = Á

如何任何想法做出选择此项工作打算? (二进制搜索或类似的东西)

Answer 1:

如何采用不同的方法,即列来回转换,以获得正确的字符集? 你可以把它转换为二进制,然后为UTF-8,然后为ISO-8859-1或其他任何你正在使用。 请参阅手册的细节。



Answer 2:

我用固定

UPDATE wp_zcs9ck_posts_copy SET post_title = 
    CONVERT(BINARY CONVERT(post_title USING latin1) USING utf8);

完整的解决方案: http://jonisalonen.com/2012/fixing-doubly-utf-8-encoded-text-in-mysql/



Answer 3:

UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'á','á');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ä','ä');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'é','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í©','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ó','ó');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'íº','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ú','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ñ','ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í‘','Ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã','í');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'–','–');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'’','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'…','...');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'–','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'“','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'‘','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'•','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'‡','c');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Â','');


Answer 4:

无文本替换是一个通用的解决方案,因为你可以忘记一些字符。 对于双转换后的文字更合适的解决方法是:

  1. 转换回LATIN1
  2. 转换为二进制
  3. 转换为UTF8

像这样:

alter table descriptions modify name VARCHAR(2000) character set latin1;
alter table descriptions modify name blob;
alter table descriptions modify name VARCHAR(2000) character set utf8;


Answer 5:

谢谢您的回答!

我修好了我这个表,并希望分享变化的完整列表。 请注意,它也包括固定的HTML解码后的字符,除了拉丁美洲的人,这真是一个烂摊子:

update `table` set `field` = replace(`field` ,'É','É');
update `table` set `field` = replace(`field` ,'“','"');
update `table` set `field` = replace(`field` ,'â€','"');
update `table` set `field` = replace(`field` ,'Ç','Ç');
update `table` set `field` = replace(`field` ,'Ã','Ã');
//Edit by slash4
update `table` set `field` = replace(`field` ,'Ã ','À');
update `table` set `field` = replace(`field` ,'ú','ú');
update `table` set `field` = replace(`field` ,'•','-');
update `table` set `field` = replace(`field` ,'Ø','Ø');
update `table` set `field` = replace(`field` ,'õ','õ');
-- The next one  appears to be missing a character. But which one?
update `table` set `field` = replace(`field` ,'í','í');
update `table` set `field` = replace(`field` ,'â','â');
update `table` set `field` = replace(`field` ,'ã','ã');
update `table` set `field` = replace(`field` ,'ê','ê');
update `table` set `field` = replace(`field` ,'á','á');
update `table` set `field` = replace(`field` ,'é','é');
update `table` set `field` = replace(`field` ,'ó','ó');
update `table` set `field` = replace(`field` ,'–','–');
update `table` set `field` = replace(`field` ,'ç','ç');
update `table` set `field` = replace(`field` ,'ª','ª');
update `table` set `field` = replace(`field` ,'º','º');
update `table` set `field` = replace(`field` ,'à','à');
update `table` set `field` = replace(`field` ,'ç','ç');
update `table` set `field` = replace(`field` ,'ã','ã');
update `table` set `field` = replace(`field` ,'á','á');
update `table` set `field` = replace(`field` ,'â','â');
update `table` set `field` = replace(`field` ,'é','é');
update `table` set `field` = replace(`field` ,'í','í');
update `table` set `field` = replace(`field` ,'õ','õ');
update `table` set `field` = replace(`field` ,'ú','ú');
update `table` set `field` = replace(`field` ,'ç','ç');
update `table` set `field` = replace(`field` ,'Á','Á');
update `table` set `field` = replace(`field` ,'Â','Â');
update `table` set `field` = replace(`field` ,'É','É');
update `table` set `field` = replace(`field` ,'Í','Í');
update `table` set `field` = replace(`field` ,'Õ','Õ');
update `table` set `field` = replace(`field` ,'Ú','Ú');
update `table` set `field` = replace(`field` ,'Ç','Ç');
update `table` set `field` = replace(`field` ,'Ã','Ã');
update `table` set `field` = replace(`field` ,'À','À');
update `table` set `field` = replace(`field` ,'Ê','Ê');
update `table` set `field` = replace(`field` ,'Ó','Ó');
update `table` set `field` = replace(`field` ,'Ô','Ô');
update `table` set `field` = replace(`field` ,'Ü','Ü');
update `table` set `field` = replace(`field` ,'ã','ã');
update `table` set `field` = replace(`field` ,'à','à');
update `table` set `field` = replace(`field` ,'ê','ê');
update `table` set `field` = replace(`field` ,'ó','ó');
update `table` set `field` = replace(`field` ,'ô','ô');
update `table` set `field` = replace(`field` ,'ü','ü');
update `table` set `field` = replace(`field` ,'&','&');
update `table` set `field` = replace(`field` ,'>','>');
update `table` set `field` = replace(`field` ,'&lt;','<');
update `table` set `field` = replace(`field` ,'&circ;','ˆ');
update `table` set `field` = replace(`field` ,'&tilde;','˜');
update `table` set `field` = replace(`field` ,'&uml;','¨');
update `table` set `field` = replace(`field` ,'&cute;','´');
update `table` set `field` = replace(`field` ,'&cedil;','¸');
update `table` set `field` = replace(`field` ,'&quot;','"');
update `table` set `field` = replace(`field` ,'&ldquo;','“');
update `table` set `field` = replace(`field` ,'&rdquo;','”');
update `table` set `field` = replace(`field` ,'&lsquo;','‘');
update `table` set `field` = replace(`field` ,'&rsquo;','’');
update `table` set `field` = replace(`field` ,'&lsaquo;','‹');
update `table` set `field` = replace(`field` ,'&rsaquo;','›');
update `table` set `field` = replace(`field` ,'&laquo;','«');
update `table` set `field` = replace(`field` ,'&raquo;','»');
update `table` set `field` = replace(`field` ,'&ordm;','º');
update `table` set `field` = replace(`field` ,'&ordf;','ª');
update `table` set `field` = replace(`field` ,'&ndash;','–');
update `table` set `field` = replace(`field` ,'&mdash;','—');
update `table` set `field` = replace(`field` ,'&macr;','¯');
update `table` set `field` = replace(`field` ,'&hellip;','…');
update `table` set `field` = replace(`field` ,'&brvbar;','¦');
update `table` set `field` = replace(`field` ,'&bull;','•');
update `table` set `field` = replace(`field` ,'&para;','¶');
update `table` set `field` = replace(`field` ,'&sect;','§');
update `table` set `field` = replace(`field` ,'&sup1;','¹');
update `table` set `field` = replace(`field` ,'&sup2;','²');
update `table` set `field` = replace(`field` ,'&sup3;','³');
update `table` set `field` = replace(`field` ,'&frac12;','½');
update `table` set `field` = replace(`field` ,'&frac14;','¼');
update `table` set `field` = replace(`field` ,'&frac34;','¾');
update `table` set `field` = replace(`field` ,'&#8539;','⅛');
update `table` set `field` = replace(`field` ,'&#8540;','⅜');
update `table` set `field` = replace(`field` ,'&#8541;','⅝');
update `table` set `field` = replace(`field` ,'&#8542;','⅞');
update `table` set `field` = replace(`field` ,'&gt;','>');
update `table` set `field` = replace(`field` ,'&lt;','<');
update `table` set `field` = replace(`field` ,'&plusmn;','±');
update `table` set `field` = replace(`field` ,'&minus;','−');
update `table` set `field` = replace(`field` ,'&times;','×');
update `table` set `field` = replace(`field` ,'&divide;','÷');
update `table` set `field` = replace(`field` ,'&lowast;','∗');
update `table` set `field` = replace(`field` ,'&frasl;','⁄');
update `table` set `field` = replace(`field` ,'&permil;','‰');
update `table` set `field` = replace(`field` ,'&int;','∫');
update `table` set `field` = replace(`field` ,'&sum;','∑');
update `table` set `field` = replace(`field` ,'&prod;','∏');
update `table` set `field` = replace(`field` ,'&radic;','√');
update `table` set `field` = replace(`field` ,'&infin;','∞');
update `table` set `field` = replace(`field` ,'&asymp;','≈');
update `table` set `field` = replace(`field` ,'&cong;','≅');
update `table` set `field` = replace(`field` ,'&prop;','∝');
update `table` set `field` = replace(`field` ,'&equiv;','≡');
update `table` set `field` = replace(`field` ,'&ne;','≠');
update `table` set `field` = replace(`field` ,'&le;','≤');
update `table` set `field` = replace(`field` ,'&ge;','≥');
update `table` set `field` = replace(`field` ,'&there4;','∴');
update `table` set `field` = replace(`field` ,'&sdot;','⋅');
update `table` set `field` = replace(`field` ,'&middot;','·');
update `table` set `field` = replace(`field` ,'&part;','∂');
update `table` set `field` = replace(`field` ,'&image;','ℑ');
update `table` set `field` = replace(`field` ,'&real;','ℜ');
update `table` set `field` = replace(`field` ,'&prime;','′');
update `table` set `field` = replace(`field` ,'&Prime;','″');
update `table` set `field` = replace(`field` ,'&deg;','°');
update `table` set `field` = replace(`field` ,'&ang;','∠');
update `table` set `field` = replace(`field` ,'&perp;','⊥');
update `table` set `field` = replace(`field` ,'&nabla;','∇');
update `table` set `field` = replace(`field` ,'&oplus;','⊕');
update `table` set `field` = replace(`field` ,'&otimes;','⊗');
update `table` set `field` = replace(`field` ,'&alefsym;','ℵ');
update `table` set `field` = replace(`field` ,'&oslash;','ø');
update `table` set `field` = replace(`field` ,'&Oslash;','Ø');
update `table` set `field` = replace(`field` ,'&isin;','∈');
update `table` set `field` = replace(`field` ,'&notin;','∉');
update `table` set `field` = replace(`field` ,'&cap;','∩');
update `table` set `field` = replace(`field` ,'&cup;','∪');
update `table` set `field` = replace(`field` ,'&sub;','⊂');
update `table` set `field` = replace(`field` ,'&sup;','⊃');
update `table` set `field` = replace(`field` ,'&sube;','⊆');
update `table` set `field` = replace(`field` ,'&supe;','⊇');
update `table` set `field` = replace(`field` ,'&exist;','∃');
update `table` set `field` = replace(`field` ,'&forall;','∀');
update `table` set `field` = replace(`field` ,'&empty;','∅');
update `table` set `field` = replace(`field` ,'&not;','¬');
update `table` set `field` = replace(`field` ,'&and;','∧');
update `table` set `field` = replace(`field` ,'&or;','∨');
update `table` set `field` = replace(`field` ,'&crarr;','↵');


Answer 6:

SELECT需要声明如下:

SELECT * FROM TABLE WHERE LENGTH(name) != CHAR_LENGTH(name);

这将返回包含多字节字符的所有行。

name被认为是一个字段/哪里奇怪的字符会被发现场。 *



Answer 7:

这救了我的命

UPDATE ohp_posts SET post_content = CONVERT(CAST(CONVERT(post_content USING latin1) AS BINARY) USING utf8)

我发现在这里http://stanis.net/2014/04/replacing-latin-1-with-utf-8-characters-in-mysql/



Answer 8:

我有同样的问题,但并没有像更换()解决方案,因为总有一些丢失字符的可能性。 我正在对混合数据的列(一些已经函数utf8_encode()d和一些不)400万行左右,约25万条记录与错误编码数据(与‰/等字符),占地约15种国际语言,主要包括欧洲语言,而且俄罗斯,日本和中国。

我开始通过复制列,因为我不想丢失任何数据:

ALTER TABLE images ADD COLUMN reptitle TEXT;

复制的所有具有多字节字符(感谢亚当的尖端)的数据

UPDATE images SET reptitle = title WHERE LENGTH(title) != CHAR_LENGTH(title)

由于reptitle与表的默认字符集创建它已经是UTF8,但包含损坏的数据,因为图像表曾经是一个拉丁来源。 列reptitle现在包含一些数据是正确编码,以及一些损坏的(所有带多字节字符值,一些已正确函数utf8_encode()d。所以后来与大卫的提示...

ALTER TABLE images MODIFY reptitle TEXT character set latin1;
ALTER TABLE images MODIFY reptitle BLOB;
ALTER TABLE images MODIFY reptitle TEXT character set utf8;

因为TEXT和BLOB(我认为)是相同的中间步骤可不能是必要的。 这不得不修正所有错误编码数据(成为“étudiantes”“A©tudiantes”等)(成为“拉平去P”“落聘德帕克”)的影响,但是这在以前是正确的,在第一个多字节字符被截断的数据。 我不知道为什么截断,但它在一次性柱,所以我也没在意。 截掉的数据给出了CHAR_LENGTH和相同的价值观的长度,因为没有剩余的那么容易查询多字节字符...

UPDATE images SET title = reptitle WHERE LENGTH(reptitle)!=CHAR_LENGTH(reptitle)

然后,当然刚落备用列

ALTER TABLE images DROP COLUMN reptitle

另外,还要确保(因为我使用PHP,这绊倒了我几次,所以我想我会在这里提到它),你的所有脚本文件是UTF8(无BOM),并且使用:

mysql_set_charset('utf8', $connection);

等瞧...完美修复的数据,所有的语言:)



Answer 9:

除了劳尔·阿维拉索拉诺和acseven的答案,如果你要更新一个查询就可以完成所有的碎字符

update `table` set field = replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(field,'&uuml;','ü'),'&ocirc;','ô'),'&oacute;','ó'),'&ecirc;','ê'),'&agrave;','à'),'&atilde;','ã'),'&Uuml;','Ü'),'&Ocirc;','Ô'),'&Oacute;','Ó'),'&Ecirc;','Ê'),'&Agrave;','À'),'&Atilde;','Ã'),'&Ccedil;','Ç'),'&Uacute;','Ú'),'&Otilde;','Õ'),'&Iacute;','Í'),'&Iacute;','Í'),'&Eacute;','É'),'&Acirc;','Â'),'&Aacute;','Á'),'&ccedil;','ç'),'&uacute;','ú'),'&otilde;','õ'),'&iacute;','í'),'&eacute;','é'),'&acirc;','â'),'&aacute;','á'),'&atilde;','ã'),'&ccedil;','ç'),'à ','à'),'à ','à'),'º','º'),'ª','ª'),'ç','ç'),'–','–'),'ó','ó'),'é','é'),'á','á'),'ê','ê'),'ã','ã'),'â','â'),'í','í'),'õ','õ'),'Ø','Ø'),'•','-'),'ú','ú'),'à ','À'),'Ã','Ã'),'Ç','Ç'),'â€','"'),'“','"'),'É','É');


Answer 10:

这也解决了我的问题对一些意大利字符

UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'á','á');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ä','ä');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'é','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í©','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ó','ó');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'íº','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ú','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ñ','ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í‘','Ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã','í');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'–','–');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'’','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'…','...');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'–','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'“','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'‘','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'•','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'‡','c');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Â','');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í ','à');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í¨','è');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'íˆ','È');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'€','€');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'eÌ€','è');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í²','ò');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í¹','ù');


Answer 11:

你可能有行与正确编码UTF8与编码错误的字符。 在这种情况下,“转换(二进制格式转换(POST_TITLE使用LATIN1)用UTF8)”将削减一些领域。

最后我做这种方式

update `table` set `name` = replace(`name` ,CONVERT(BINARY "ä" USING latin1),'ä');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "ö" USING latin1),'ö');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "ü" USING latin1),'ü');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "Ä" USING latin1),'Ä');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "Ö" USING latin1),'Ö');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "Ü" USING latin1),'Ü');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "ß" USING latin1),'ß');


Answer 12:

基于在这个岗位数据https://www.i18nqa.com/debug/utf8-debug.html我建议这是识别狡猾条目和可能正确的值的一个很好的查询:

SELECT my_field,CONVERT(BINARY CONVERT(my_field USING latin1) USING utf8mb4) AS new_field_value FROM my_table WHERE my_field REGEXP '[âÆËÅÂÃ]';

要非常小心,因为我们有一个文件名的坏编码,但路径的确定编码,并且在这种情况下,一些解决方案上面会造成痛苦的世界。 如果你的一些数据已经在正确的utf-8编码,你可能会发现你失去的是一大块。



Answer 13:

由于中间步骤可能没有必要TEXTBLOB是相同的。

这不得不校正所有错误的编码数据,但是这在以前是正确的第一多字节字符被截断数据的效果。



Answer 14:

有一个很好的脚本来自动转换过程在整个数据库。 这也是有用的知道,MySQL的UTF-8的实现是不完整的,因为它仅支持UTF-8字符最多3个字节。 该解决方案是使用在MySQL 5.5.3推出了utf8mb4字符集。



Answer 15:

这是@Thales Ceolin的,以修改每个表在DB答案的扩展:

select concat(
    "update ", 
    a.TABLE_NAME, 
    " set ", b.COLUMN_NAME, 
    " = CONVERT(BINARY CONVERT(", 
    b.COLUMN_NAME, 
    " USING latin1) USING utf8) where ",
    b.COLUMN_NAME, 
    " is not null;") query
from INFORMATION_SCHEMA.TABLES a
left join INFORMATION_SCHEMA.COLUMNS b on a.TABLE_NAME = b.TABLE_NAME
where a.table_schema = 'db_name'
and a.TABLE_TYPE = 'BASE TABLE'
and b.data_type in ('text', 'varchar')
and a.TABLE_NAME = 'table_name';

这将导致:

update table_name set idn = CONVERT(BINARY CONVERT(idn USING latin1) USING utf8) where idn is not null;
update table_nameset name = CONVERT(BINARY CONVERT(name USING latin1) USING utf8) where name is not null;
update table_name set primary_last_name = CONVERT(BINARY CONVERT(primary_last_name USING latin1) USING utf8) where primary_last_name is not null;


Answer 16:

作为主要的问题是在检测到断裂字符我的解决方案:(以防止在正常的charset双编码)

  1. 检测(LATIN1为utf8)
SELECT name FROM %table% 
 WHERE 
CONVERT(CONVERT(name USING BINARY) USING utf8 ) != CONVERT(CONVERT(CONVERT(CONVERT(name USING BINARY) USING latin1) USING BINARY) USING utf8);
  1. 更新(LATIN1为utf8)
UPDATE %table% SET name = convert(cast(convert(name using latin1 ) as binary) using utf8 )
 WHERE 
CONVERT(CONVERT(name USING BINARY) USING utf8 ) != CONVERT(CONVERT(CONVERT(CONVERT(name USING BINARY) USING latin1) USING BINARY) USING utf8);


文章来源: Detecting utf8 broken characters in MySQL
标签: mysql utf-8