I know there are tons of threads regarding this issue but I have not managed to find one which solves my problem.
I am trying to print a string but when printed it doesn't show special characters (e.g. æ, ø, å, ö and ü). When I print the string using repr()
this is what I get:
u'Von D\xc3\xbc'
and u'\xc3\x96berg'
Does anyone know how I can convert this to Von Dü
and Öberg
? It's important to me that these characters are not ignored, e.g. myStr.encode("ascii", "ignore")
.
EDIT
This is the code I use. I use BeautifulSoup to scrape a website. The contents of a cell (<td>
) in a table (<table>
), is put into the variable name
. This is the variable which contains special characters that I cannot print.
web = urllib2.urlopen(url);
soup = BeautifulSoup(web)
tables = soup.find_all("table")
scene_tables = [2, 3, 6, 7, 10]
scene_index = 0
# Iterate over the <table>s we want to work with
for scene_table in scene_tables:
i = 0
# Iterate over < td> to find time and name
for td in tables[scene_table].find_all("td"):
if i % 2 == 0: # td contains the time
time = remove_whitespace(td.get_text())
else: # td contains the name
name = remove_whitespace(td.get_text()) # This is the variable containing "nonsense"
print "%s: %s" % (time, name,)
i += 1
scene_index += 1
Prevention is better than cure. What you need is to find out how that rubbish is being created. Please edit your question to show the code that creates it, and then we can help you fix it. It looks like somebody has done:
The cure is to reverse the process, simply, and then decode.
Update Based on the code that you supplied, the probable cause is that the website declares that it is encoded in
ISO-8859-1
(akalatin1
) but in reality it is encoded in UTF-8. Please update your question to show us the url.If you can't show it, read the BS docs; it looks like you'll need to use:
Unicode support in many languages is confusing, so your error here is understandable. Those strings are UTF-8 bytes, which would work properly if you drop the
u
at the front:For lots more information:
http://www.joelonsoftware.com/articles/Unicode.html
http://docs.python.org/howto/unicode.html
You should really really read those links and understand what is going on before proceeding. If, however, you absolutely need to have something that works today, you can use this horrible hack that I am embarrassed to post publicly:
The contents of the strings are not unicode, they are UTF-8 encoded.
Edit: