I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven't found anything except some half-baked implementations in PHP, etc.
For example, something like
<p>
<ul>
<li>Foo
becomes
<p>
<ul>
<li>Foo</li>
</ul>
</p>
Any help would be appreciated :)
using BeautifulSoup:
from BeautifulSoup import BeautifulSoup
html = "<p><ul><li>Foo"
soup = BeautifulSoup(html)
print soup.prettify()
gets you
<p>
<ul>
<li>
Foo
</li>
</ul>
</p>
As far as I know, you can't control putting the <li></li> tags on separate lines from Foo.
using Tidy:
import tidy
html = "<p><ul><li>Foo"
print tidy.parseString(html, show_body_only=True)
gets you
<ul>
<li>Foo</li>
</ul>
Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing
print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)
comes out as
<p></p>
<ul>
<li>Foo</li>
</ul>
Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.
Finally, Tidy can also do indenting:
print tidy.parseString(html, show_body_only=True, indent=True)
becomes
<ul>
<li>Foo
</li>
</ul>
All of these have their ups and downs, but hopefully one of them is close enough.
Run it through Tidy or one of its ported libraries.
Try to code it by hand and you will want to gouge your eyes out.
use html5lib, work great!
like this.
soup = BeautifulSoup(data, 'html5lib')
Just now, I got a html which lxml and pyquery didn't work good on , seems there are some errors in the html.
Since Tidy is not easy to install in windows, I choose BeautifulSoup
.
But I found that:
from BeautifulSoup import BeautifulSoup
import lxml.html
soup = BeautifulSoup(page)
h = lxml.html(soup.prettify())
act same as h = lxml.html(page)
Which real solve my problem is soup = BeautifulSoup(page, 'html5lib')
.
You should install html5lib
first, then can use it as a parser in BeautifulSoup
.
html5lib
parser seems work much better than others.
Hope this can help someone.
I tried to use, below method but Failed on python 3
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(page, 'html5lib')
I tried below and got Success
soup = bs4.BeautifulSoup(html, 'html5lib')
f_html = soup.prettify()
print(f'Formatted html::: {f_html}')