How can I parse HTML with html5lib, and query the-第2页回答

How can I parse HTML with html5lib, and query the

2020-02-23 06:47发布

I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table:

<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>

so lets try it:

>>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml')
>>> doc
<lxml.etree._ElementTree object at 0x1a1c290>

that looks good, lets see what else we have:

>>> root = doc.getroot()
>>> print(lxml.etree.tostring(root))
<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head/><html:body><html:table><html:tbody><html:tr><html:td>Header</html:td></html:tr><html:tr><html:td>Want This</html:td></html:tr></html:tbody></html:table></html:body></html:html>

LOL WUT?

seriously. I was planning on using some xpath to get at the data I want, but that doesn't seem to work. So what can I do? I am willing to try different libraries and approaches.

标签： python parsing xpath lxml html5lib

7条回答

家丑人穷心不美

2楼-- · 2020-02-23 07:33

Lack of documentation is a good reason to avoid a library IMO, no matter how cool it is. Are you wedded to using html5lib? Have you looked at lxml.html?

Here is a way to do this with lxml:

from lxml import html
tree = html.fromstring(text)
[td.text for td in tree.xpath("//td")]

Result:

['Header', 'Want This']

0人赞添加讨论(0) 举报

上一页 1 2

How can I parse HTML with html5lib, and query the

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间