How to parse HTML with entities such as   usi

2019-01-26 01:58发布

There are times that you want to parse some reasonably well-formed HTML pages, but you are reluctant to introduce extra library dependency such as BeautifulSoup or lxml. So you will probably like to try the builtin ElementTree first, because it is a standard library, it is fast (implemented in C), and it supports much better interface (such as XPATH support) than the basic HTMLParser. Not to mention, HTMLParser has its own limitations.

ElementTree will work, until it encounters some entities, such as  , which are not handled by default.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''
et = ET.fromstring(html)

Run it on Python 2 or Python 3, you will see this error:

xml.etree.ElementTree.ParseError: undefined entity: line 7, column 38

There are some Q&A out there, such as this one and that one. They hint to use ElementTree.XMLParser().parser.UseForeignDTD(True) but I can not get it work in Python 3.3 and Python 3.4.

$ python3.3
Python 3.3.5 (v3.3.5:62cf4e77f785, Mar  9 2014, 01:12:57) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> ET.XMLParser().parser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'xml.etree.ElementTree.XMLParser' object has no attribute 'parser'
>>> 

2条回答
劳资没心,怎么记你
2楼-- · 2019-01-26 02:11

Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.

This works for both Python 2.6, 2.7, 3.3, 3.4.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''

magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
            <!ENTITY nbsp ' '>
            ]>'''  # You can define more entities here, if needed

et = ET.fromstring(magic + html)
查看更多
趁早两清
3楼-- · 2019-01-26 02:21

As another alternative answer, setting the attribute "entity" of the parser worked for me:

parser.entity["nbsp"] = ' '
查看更多
登录 后发表回答