Fixing encodings

2020-02-06 04:44发布

I have ended up with messed up character encodings in one of our mysql columns.

Typically I have

√© instead of é
√∂ instead of ö
√≠ instead of í

and so on...

Fairly certain that someone here would know what happened and how to fix.

UPDATE: Based on bobince's answer and since I had this data in a file I did the following

#!/user/bin/env python
import codecs
f = codecs.open('./file.csv', 'r', 'utf-8')
f2 = codecs.open('./file-fixed.csv', 'w', 'utf-8')
for line in f:
    f2.write(line.encode('macroman').decode('utf-8')),

after which

load data infile 'file-fixed.csv' 
into table list1 
fields terminated by ',' 
optionally enclosed by '"' 
ignore 1 lines;

properly imported the data.

UPDATE2: Hammerite, just for completeness here are the requested details...

mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

The SHOW CREATE TABLE for the table I am importing to has DEFAULT CHARSET=utf8

EDIT3:

Actually with the above settings the load didn't do the right thing (I could not compare to existing utf8 fields and my loaded data only looked as if it was loaded correctly; I assume because of the wrong, but matching client, connection and results charsets), so I updated the settings to:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

uploaded data again and then finally I got the data loaded correctly (comparable with existing data).

1条回答
一夜七次
2楼-- · 2020-02-06 04:58

Your text has been encoded to UTF-8 and then re-decoded, erroneously, as Mac Roman.

You won't be able to fix this inside the database, as MySQL doesn't know the Mac Roman encoding. You might write a script to walk each row of each affected table and fix up the text by reversing the encode/decode cycle. Python is a possibility, having a good range of encodings:

>>> print u'é'.encode('macroman').decode('utf-8')
é
>>> print u'ö'.encode('macroman').decode('utf-8')
ö
>>> print u'í'.encode('macroman').decode('utf-8')
í

Alternatively, if there is no non-ASCII content that is unaffected by this problem, you could try using mysqladmin to export an SQL dump, then convert it all at once, either using a script as above or, if you have a text editor (presumably on the Mac) that can do it, loading the script as UTF-8 then saving it as Mac Roman. Finally re-import the dump using mysql < dump.sql.

查看更多
登录 后发表回答