How to parse HTML with entities such as usi

There are times that you want to parse some reasonably well-formed HTML pages, but you are reluctant to introduce extra library dependency such as BeautifulSoup or lxml. So you will probably like to try the builtin ElementTree first, because it is a standard library, it is fast (implemented in C), and it supports much better interface (such as XPATH support) than the basic HTMLParser. Not to mention, HTMLParser has its own limitations.

ElementTree will work, until it encounters some entities, such as  , which are not handled by default.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''
et = ET.fromstring(html)

Run it on Python 2 or Python 3, you will see this error:

xml.etree.ElementTree.ParseError: undefined entity: line 7, column 38

There are some Q&A out there, such as this one and that one. They hint to use ElementTree.XMLParser().parser.UseForeignDTD(True) but I can not get it work in Python 3.3 and Python 3.4.

$ python3.3
Python 3.3.5 (v3.3.5:62cf4e77f785, Mar  9 2014, 01:12:57) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> ET.XMLParser().parser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'xml.etree.ElementTree.XMLParser' object has no attribute 'parser'
>>>

标签： python html parsing entity elementtree

2条回答

劳资没心，怎么记你

2楼-- · 2019-01-26 02:11

Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.

This works for both Python 2.6, 2.7, 3.3, 3.4.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''

magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
            <!ENTITY nbsp ' '>
            ]>'''  # You can define more entities here, if needed

et = ET.fromstring(magic + html)

0人赞添加讨论(0) 举报

趁早两清

3楼-- · 2019-01-26 02:21

As another alternative answer, setting the attribute "entity" of the parser worked for me:

parser.entity["nbsp"] = ' '

0人赞添加讨论(0) 举报

How to parse HTML with entities such as usi

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间