This question already has an answer here:
- Decode HTML entities in Python string? 5 answers
I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?
For example:
I get back:
ǎ
which represents an "ǎ" with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u'\u01ce'
This is a function which should help you to get it right and convert entities back to utf-8 characters.
The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:
Use the builtin
unichr
-- BeautifulSoup isn't necessary:You could find an answer here -- Getting international characters from a web page?
EDIT: It seems like
BeautifulSoup
doesn't convert entities written in hexadecimal form. It can be fixed:EDIT:
unescape()
function mentioned by @dF which useshtmlentitydefs
standard module andunichr()
might be more appropriate in this case.