I'm having trouble storing and outputting an ndash character as UTF-8 in Django.
I'm getting data from an API. In raw form, as retrieved and viewed in a text editor, given unit of data may be similar to:
"I love this detergent \u2013 it is so inspiring."
(\u2013 is & ndash; as an html entity).
If I get this straight from an API and display it in Django, no problem. It displays in my browser as a long dash. I noticed I have to do decode('utf-8')
to avoid the "'ascii' codec can't encode character" error if I try to do some operations with that text in my view, though. The text is going to the template as "I love this detergent\u2013 it is so inspiring.", according to the Django Debug Toolbar.
When stored to MySQL and read for output through the same view and template, however, it ends up looking like
"I love this detergent – it is so inspiring"
My MySQL table is set to DEFAULT CHARSET=utf8
.
Now, when I read the data from the database through the MysQl monitor in a terminal set to Utf-8, it shows up as
"I love this detergent – it is so inspiring"
(correct - shows an ndash)
When I use mysqldb in a python shell, this line is
"I love this detergent \xe2\x80\x93 it is so inspiring"
(this is the correct UTF-8 for an ndash)
However, if I run python manage.py shell
, and then
In [1]: import myproject.myapp.models ThatTable
In [2]: msg=ThatTable.objects.all().filter(thefield__contains='detergent')
In [3]: msg
Out[4]: [{'thefield': 'I love this detergent \xc3\xa2\xe2\x82\xac\xe2\x80\x9c it is so inspiring'}]
It appears to me that Django has taken \xe2\x80\x93
to mean three separate characters, and encoded it as UTF-8 into \xc3\xa2\xe2\x82\xac\xe2\x80\x9c
. This displays as – because \xe2 appears to be â, \x80 appears to be €, etc. I've checked and this is how it's being sent to the template, as well.
If you decode the long sequence in Python, though, with decode('utf-8')
, the result is \xe2\u20ac\u201c
which also renders in the browser as –. Trying to decode it again yields a UnicodeDecodeError.
I've followed the Django suggestions for Unicode, as far as I know (configured MySQL).
Any suggestions on what I may have misconfigured?
addendum It seems this same issue has cropped up in other areas or systems as well., as while searching for \xc3\xa2\xe2\x82\xac\xe2\x80\x9c, I found at http://pastie.org/908443.txt a script to 'repair bad UTF8 entities.', also found in a wordpress RSS import plug in. It simply replaces this sequence with –. I'd like to solve this the right way, though!
Oh, and I'm using Django 1.2 and Python 2.6.5.
I can connect to the same database with PHP/PDO and print out this data without doing anything special, and it looks fine.