This question already has an answer here:
- Decode HTML entities in Python string? 5 answers
I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?
For example:
I get back:
ǎ
which represents an "ǎ" with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u'\u01ce'
Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities.
Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:
Not sure why the Stack Overflow thread does not include the ';' in the search/replace (i.e. lambda m: '&#%d*;*') If you don't, BeautifulSoup can barf because the adjacent character can be interpreted as part of the HTML code (i.e. 'B for 'Blackout).
This worked better for me:
html_string = re.sub('&#x([^;]+);', lambda m: '&#%d;' % int(m.group(1), 16), html_string)
If you are on Python 3.4 or newer, you can simply use the
html.unescape
:An alternative, if you have lxml:
Here is the Python 3 version of dF's answer:
The main changes concern
htmlentitydefs
that is nowhtml.entities
andunichr
that is nowchr
. See this Python 3 porting guide.Another solution is the builtin library xml.sax.saxutils (both for html and xml). However, it will convert only >, & and <.