Python output replaces non ASCII characters with �

2019-05-07 13:50发布

问题:

I am using Python 2.7 to read data from a MySQL table. In MySQL the name looks like this:

Garasa, Ángel.

But when I print it in Python the output is

Garasa, �ngel

The character set name in MySQL is utf8. This is my Python code:

# coding: utf-8

import MySQLdb

connection = MySQLdb.connect     
(host="localhost",user="root",passwd="root",db="jmdb")
cursor = connection.cursor ()
cursor.execute ("select * from actors where actorid=672462;")
data = cursor.fetchall ()
for row in data:
    print  "IMDB Name=",row[4]
    wiki=("".join(row[4]))
    print wiki

I have tried decoding it, but get error such as:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 8: invalid start byte

I have read about decoding and UTF-8 but couldn't find a solution.

回答1:

Get the Mysql driver to return Unicode strings instead. This means that you don't have to deal with decoding in your code.

Simply set use_unicode=True in the connection parameters. If the table has been set with a specific encoding then set the charset attribute accordingly.



回答2:

I think the right character mapping in your case is cp1252 :

>>> s = 'Garasa, Ángel.'
>>> s.decode('utf-8')

Traceback (most recent call last):
  File "<pyshell#63>", line 1, in <module>
    s.decode('utf-8')
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 8: invalid start byte

>>> s.decode('cp1252')
u'Garasa, \xc1ngel.'
>>>
>>> print s.decode('cp1252')
Garasa, Ángel.

EDIT: It could also be possible that it is latin-1 as well:

>>> s.decode('latin-1')
u'Garasa, \xc1ngel.'
>>> print s.decode('latin-1')
Garasa, Ángel.

As cp1252 and latin-1 code pages intersects for all codes except the range 128 to 159.

Quoting from this source (latin-1):

The Windows-1252 codepage coincides with ISO-8859-1 for all codes except the range 128 to 159 (hex 80 to 9F), where the little-used C1 controls are replaced with additional characters including all the missing characters provided by ISO-8859-15

And this one (cp1252):

This character encoding is a superset of ISO 8859-1, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range.