using beautifulsoup with html5lib, it puts the html, head and body tags automatically:
BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>
is there any option that I can set, turn off this behavior ?
If you want it to look better, try this:
This parses the HTML with Python's builtin HTML parser. Quoting the docs:
Alternatively, you could use the
html5lib
parser and just select the element after<body>
:Your only option is to not use
html5lib
to parse the data.That's a feature of the
html5lib
library, it fixes HTML that is lacking, such as adding back in missing required elements.You could remove html and body by specify
soup.body.<tag>
:Also you could use unwrap to remove body, head, and html
If you load xml file,
bs4.diagnose(data)
will tell you to uselxml-xml
, which will not wrap your soup withhtml+body
Yet another solution: