I'm trying to parse a fragment of html:
<body><h1>title</h1><img src=""></body>
I use lxml.html.fromstring
. And it is driving me insane because it keeps stripping the <body>
tag of my fragments:
> lxml.html.fromstring('<html><h1>a</h1></html>').tag
'html'
> lxml.html.fromstring('<div><h1>a</h1></div>').tag
'div'
> lxml.html.fromstring('<body><h1>a</h1></body>').tag
'h1'
I've also tried the document_fromstring
, fragment_fromstring
, clean_html
with page_structure=False
, etc... nothing works.
I need to use lxml, since I'm passing the html fragment to PyQuery.
I just want lxml to not mess with my html fragment. Is it possible to do that?
.fragment_fromstring()
removes the <html>
tag as well; basically, whenever you do not have a HTML document (with a <html>
top-level element and/or a doctype), .fromstring()
falls back to .fragment_fromstring()
and that method removes both the <html>
and the <body>
tags, always.
The work-around is to tell .fragment_fromstring()
to give you a <body>
parent tag:
>>> lxml.html.fragment_fromstring('<body><h1>a</h1></body>', create_parent='body')
<Element body at 0x10d06fbf0>
This does not preserve any attributes on the original <body>
tag.
Another work-around is to use the .document_fromstring()
method, which will wrap your document in a <html>
tag, which you then can remove again:
>>> lxml.html.document_fromstring('<body><h1>a</h1></body>')[0]
<Element body at 0x10d06fcb0>
This does preserve attributes on the <body>
:
>>> lxml.html.document_fromstring('<body class="foo"><h1>a</h1></body>')[0].attrib
{'class': 'foo'}
Using the .document_fromstring()
function on your first example gives:
>>> body = lxml.html.document_fromstring('<body><h1>title</h1><img src=""></body>')[0]
>>> lxml.html.tostring(body)
'<body><h1>title</h1><img src=""></body>'
If you only want to do this if there is no HTML tag, do what lxml.html.fromstring()
does and test for a full document:
htmltest = lxml.html._looks_like_full_html_bytes if isinstance(inputtext, str) else lxml.html._looks_like_full_html_unicode
if htmltest(inputtext):
tree = lxml.html.fromstring(inputtext)
else:
tree = lxml.html.document_fromstring(inputtext)[0]