I am using Python 2.7 to read data from a MySQL table. In MySQL the name looks like this:
Garasa, Ángel.
But when I print it in Python the output is
Garasa, �ngel
The character set name in MySQL is utf8.
This is my Python code:
# coding: utf-8
import MySQLdb
connection = MySQLdb.connect
(host="localhost",user="root",passwd="root",db="jmdb")
cursor = connection.cursor ()
cursor.execute ("select * from actors where actorid=672462;")
data = cursor.fetchall ()
for row in data:
print "IMDB Name=",row[4]
wiki=("".join(row[4]))
print wiki
I have tried decoding it, but get error such as:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 8:
invalid start byte
I have read about decoding and UTF-8 but couldn't find a solution.
Get the Mysql driver to return Unicode strings instead. This means that you don't have to deal with decoding in your code.
Simply set use_unicode=True
in the connection parameters. If the table has been set with a specific encoding then set the charset
attribute accordingly.
I think the right character mapping in your case is cp1252
:
>>> s = 'Garasa, Ángel.'
>>> s.decode('utf-8')
Traceback (most recent call last):
File "<pyshell#63>", line 1, in <module>
s.decode('utf-8')
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 8: invalid start byte
>>> s.decode('cp1252')
u'Garasa, \xc1ngel.'
>>>
>>> print s.decode('cp1252')
Garasa, Ángel.
EDIT: It could also be possible that it is latin-1
as well:
>>> s.decode('latin-1')
u'Garasa, \xc1ngel.'
>>> print s.decode('latin-1')
Garasa, Ángel.
As cp1252
and latin-1
code pages intersects for all codes except the range 128 to 159.
Quoting from this source (latin-1
):
The Windows-1252 codepage coincides with ISO-8859-1 for all codes
except the range 128 to 159 (hex 80 to 9F), where the little-used C1
controls are replaced with additional characters including all the
missing characters provided by ISO-8859-15
And this one (cp1252
):
This character encoding is a superset of ISO 8859-1, but differs from
the IANA's ISO-8859-1 by using displayable characters rather than
control characters in the 80 to 9F (hex) range.