How can I parse HTML with html5lib, and query the

2020-02-23 06:47发布

I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table:

<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>

so lets try it:

>>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml')
>>> doc
<lxml.etree._ElementTree object at 0x1a1c290>

that looks good, lets see what else we have:

>>> root = doc.getroot()
>>> print(lxml.etree.tostring(root))
<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head/><html:body><html:table><html:tbody><html:tr><html:td>Header</html:td></html:tr><html:tr><html:td>Want This</html:td></html:tr></html:tbody></html:table></html:body></html:html>

LOL WUT?

seriously. I was planning on using some xpath to get at the data I want, but that doesn't seem to work. So what can I do? I am willing to try different libraries and approaches.

7条回答
孤傲高冷的网名
2楼-- · 2020-02-23 07:14

I always recommend to try out lxml library. It's blazingly fast and has many features.

It has also support for html5lib parser if you need that: html5parser

>>> from lxml.html import fromstring, tostring

>>> html = """
... <html>
...     <table>
...         <tr><td>Header</td></tr>
...         <tr><td>Want This</td></tr>
...     </table>
... </html>
... """
>>> doc = fromstring(html)
>>> tr = doc.cssselect('table tr')[1]
>>> print tostring(tr)
<tr><td>Want This</td></tr>
查看更多
啃猪蹄的小仙女
3楼-- · 2020-02-23 07:16

Since html5lib (by default) creates trees that contain (correct) namespace information you have specify (the right) namespaces in your queries, as well.

Example with an XPath query:

import html5lib
inp='''<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>'''
xns = '{http://www.w3.org/1999/xhtml}'
d = html5lib.parse(inp)
s = d.findall('.//{}td'.format(xns))[-1].text
print(s)

Output:

Want This

The same result without XPath:

s = d.find(xns+'body').find(xns+'table').find(xns+'tbody') \
     .findall(xns+'tr')[-1].find(xns+'td').text

Alternatively, you can also tell html5lib to avoid adding any namespace information during parsing:

d = html5lib.parse(inp, namespaceHTMLElements=False)
s = d.findall('.//td')[-1].text
print(s)

Output:

Want This
查看更多
Lonely孤独者°
4楼-- · 2020-02-23 07:19

What you want to use is the namespaceHTMLElements argument, which for some reason defaults to True.

doc = html5lib.parse('''<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>
''', treebuilder='lxml', namespaceHTMLElements=False)

print lxml.html.tostring(doc)

It's probably still easier to use lxml.html however.

查看更多
我只想做你的唯一
5楼-- · 2020-02-23 07:22

i believe you can do css search on lxml objects.. like so

elements = root.cssselect('div.content')
data = elements[0].text
查看更多
狗以群分
6楼-- · 2020-02-23 07:29

With BeautifulSoup, you can do that with

>>> soup = BeautifulSoup.BeautifulSoup('<html><table><tr><td>Header</td></tr><tr><td>Want This</td></tr></table></html>')
>>> soup.findAll('td')[1].string
u'Want This'
>>> soup.findAll('tr')[1].td.string
u'Want This'

(Obviously that's a really crude example, but ya.)

查看更多
Emotional °昔
7楼-- · 2020-02-23 07:32

try using jquery. and you can retrieve all elements. alternately, you can put an id on your row and pull it out.

1) ... ...

$("td")[1].innerHTML will be what you want

2) ... ...

$("#blah").text() will be what you want

查看更多
登录 后发表回答