I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table:
<html>
<table>
<tr><td>Header</td></tr>
<tr><td>Want This</td></tr>
</table>
</html>
so lets try it:
>>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml')
>>> doc
<lxml.etree._ElementTree object at 0x1a1c290>
that looks good, lets see what else we have:
>>> root = doc.getroot()
>>> print(lxml.etree.tostring(root))
<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head/><html:body><html:table><html:tbody><html:tr><html:td>Header</html:td></html:tr><html:tr><html:td>Want This</html:td></html:tr></html:tbody></html:table></html:body></html:html>
LOL WUT?
seriously. I was planning on using some xpath to get at the data I want, but that doesn't seem to work. So what can I do? I am willing to try different libraries and approaches.
I always recommend to try out
lxml
library. It's blazingly fast and has many features.It has also support for html5lib parser if you need that: html5parser
Since html5lib (by default) creates trees that contain (correct) namespace information you have specify (the right) namespaces in your queries, as well.
Example with an XPath query:
Output:
The same result without XPath:
Alternatively, you can also tell html5lib to avoid adding any namespace information during parsing:
Output:
What you want to use is the
namespaceHTMLElements
argument, which for some reason defaults to True.It's probably still easier to use lxml.html however.
i believe you can do css search on lxml objects.. like so
With BeautifulSoup, you can do that with
(Obviously that's a really crude example, but ya.)
try using jquery. and you can retrieve all elements. alternately, you can put an id on your row and pull it out.
1) ... ...
$("td")[1].innerHTML will be what you want
2) ... ...
$("#blah").text() will be what you want