How do I fix wrongly nested / unclosed HTML tags?

2019-01-09 01:06发布

问题:

I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven't found anything except some half-baked implementations in PHP, etc.

For example, something like

<p>
  <ul>
    <li>Foo

becomes

<p>
  <ul>
    <li>Foo</li>
  </ul>
</p>

Any help would be appreciated :)

回答1:

using BeautifulSoup:

from BeautifulSoup import BeautifulSoup
html = "<p><ul><li>Foo"
soup = BeautifulSoup(html)
print soup.prettify()

gets you

<p>
 <ul>
  <li>
   Foo
  </li>
 </ul>
</p>

As far as I know, you can't control putting the <li></li> tags on separate lines from Foo.

using Tidy:

import tidy
html = "<p><ul><li>Foo"
print tidy.parseString(html, show_body_only=True)

gets you

<ul>
<li>Foo</li>
</ul>

Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing

print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)

comes out as

<p></p>
<ul>
<li>Foo</li>
</ul>

Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.

Finally, Tidy can also do indenting:

print tidy.parseString(html, show_body_only=True, indent=True)

becomes

<ul>
  <li>Foo
  </li>
</ul>

All of these have their ups and downs, but hopefully one of them is close enough.



回答2:

Run it through Tidy or one of its ported libraries.

Try to code it by hand and you will want to gouge your eyes out.



回答3:

use html5lib, work great! like this.

soup = BeautifulSoup(data, 'html5lib')



回答4:

Just now, I got a html which lxml and pyquery didn't work good on , seems there are some errors in the html. Since Tidy is not easy to install in windows, I choose BeautifulSoup. But I found that:

from BeautifulSoup import BeautifulSoup
import lxml.html
soup = BeautifulSoup(page)
h = lxml.html(soup.prettify())

act same as h = lxml.html(page)

Which real solve my problem is soup = BeautifulSoup(page, 'html5lib').
You should install html5lib first, then can use it as a parser in BeautifulSoup. html5lib parser seems work much better than others.

Hope this can help someone.



回答5:

I tried to use, below method but Failed on python 3

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(page, 'html5lib')

I tried below and got Success

soup = bs4.BeautifulSoup(html, 'html5lib')
f_html = soup.prettify()
print(f'Formatted html::: {f_html}')