I have a SQL query that I execute like this with an SQLAlchemy engine:
result = engine.execute('SELECT utf_8_field FROM table')
The database is MySQL and the column type is TEXT with UTF-8 encoding. The type of the returned utf_8_field is "str", even if I set the option convert_unicode=True when creating the engine. What happens now is that if I have a character like 'é' in my string (which is not in 7-bit ASCII, but is in the extended ASCII set), I get a UnicodeDecodeError when trying to execute this:
utf_8_field.encode("utf-8")
The exact error is:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1: ordinal not in range(128)
When looking into this, I found that str.encode do not support the extended ASCII character set! I find this really strange, but that's another question.
What I don't understand is why SQLAlchemy is not giving me a unicode string. I was previously using DB-API and that was working fine. I also don't have SQLAlchemy table objects for my tables yet, that's why I'm using an execute command.
Any idea?
If you want the data converted automatically, you should specify the charset when you create the engine:
create_engine('mysql+mysqldb:///mydb?charset=utf8')
Setting use_unicode
alone won't tell sqlalchemy which charset to use.
To convert from an UTF-8 bytestring to a unicode object, you need to decode:
utf_8_field.decode('utf8')
Also, when executing a raw SELECT
through .execute
, SQLAlchemy has no metadata to work out that your query is returning utf-8 data, so it is not converting this information to unicode for you.
In other words, convert_unicode
only works if you use the SQLAlchemy SQL expression API or the ORM functionality.
EDIT: As pointed out, your data is not even UTF-8 encoded; 0xe9
in UTF-8 would indicate a character between \u9000
and \u9fff
, which are CJK unified ideographs while you said it was a latin-1 character, whose UTF-8 code would start with 0xc3
. This is probably ISO-8859-1
(latin-1) or similar instead:
>>> u'é'.encode('ISO-8859-1')
'\xe9'
The conclusion then is to tell SQLAlchemy to connect with a different character set, using the charset=utf8
parameter, as pointed out by @mata.