using beautifulsoup with html5lib, it puts the html, head and body tags automatically:
BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>
is there any option that I can set, turn off this behavior ?
In [35]: import bs4 as bs
In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser")
Out[36]: <h1>FOO</h1>
This parses the HTML with Python's builtin HTML parser.
Quoting the docs:
Unlike html5lib, this parser makes no attempt to create a well-formed
HTML document by adding a <body>
tag. Unlike lxml, it doesn’t even
bother to add an <html>
tag.
Alternatively, you could use the html5lib
parser and just select the element after <body>
:
In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib')
In [62]: soup.body.next
Out[62]: <h1>FOO</h1>
Your only option is to not use html5lib
to parse the data.
That's a feature of the html5lib
library, it fixes HTML that is lacking, such as adding back in missing required elements.
Yet another solution:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello <a href="http://google.com">Google</a></p><p>Hi!</p>', 'lxml')
# content handling example (just for example)
# replace Google with StackOverflow
for a in soup.findAll('a'):
a['href'] = 'http://stackoverflow.com/'
a.string = 'StackOverflow'
print ''.join([unicode(i) for i in soup.html.body.findChildren(recursive=False)])
You could remove html and body by specify soup.body.<tag>
:
# python3: first child
print(next(soup.body.children))
# if first child's tag is rss
print(soup.body.rss)
Also you could use unwrap to remove body, head, and html
soup.html.body.unwrap()
if soup.html.select('> head'):
soup.html.head.unwrap()
soup.html.unwrap()
If you load xml file, bs4.diagnose(data)
will tell you to use lxml-xml
, which will not wrap your soup with html+body
>>> BS('<foo>xxx</foo>', 'lxml-xml')
<foo>xxx</foo>
If you want it to look better, try this:
BeautifulSoup([contents you want to analyze].prettify())