Python HTML sanitizer / scrubber / filter [closed]

2019-01-02 17:17发布

站内文章 / Python

17 0

孤独总比滥情好

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm looking for a module that will remove any HTML tags from a string that are not found in a whitelist.

回答1:

Here's a simple solution using BeautifulSoup:

from bs4 import BeautifulSoup

VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']

def sanitize_html(value):

    soup = BeautifulSoup(value)

    for tag in soup.findAll(True):
        if tag.name not in VALID_TAGS:
            tag.hidden = True

    return soup.renderContents()

If you want to remove the contents of the invalid tags as well, substitute tag.extract() for tag.hidden.

You might also look into using lxml and Tidy.

回答2:

Use lxml.html.clean! It's VERY easy!

from lxml.html.clean import clean_html
print clean_html(html)

Suppose the following html:

html = '''\
<html>
 <head>
   <script type="text/javascript" src="evil-site"></script>
   <link rel="alternate" type="text/rss" src="evil-rss">
   <style>
     body {background-image: url(javascript:do_evil)};
     div {color: expression(evil)};
   </style>
 </head>
 <body onload="evil_function()">
    <!-- I am interpreted for EVIL! -->
   <a href="javascript:evil_function()">a link</a>
   <a href="#" onclick="evil_function()">another link</a>
   <p onclick="evil_function()">a paragraph</p>
   <div style="display: none">secret EVIL!</div>
   <object> of EVIL! </object>
   <iframe src="evil-site"></iframe>
   <form action="evil-site">
     Password: <input type="password" name="password">
   </form>
   <blink>annoying EVIL!</blink>
   <a href="evil-site">spam spam SPAM!</a>
   <image src="evil!">
 </body>
</html>'''

The results...

<html>
  <body>
    <div>
      <style>/* deleted */</style>
      <a href="">a link</a>
      <a href="#">another link</a>
      <p>a paragraph</p>
      <div>secret EVIL!</div>
      of EVIL!
      Password:
      annoying EVIL!
      <a href="evil-site">spam spam SPAM!</a>
      <img src="evil!">
    </div>
  </body>
</html>

You can customize the elements you want to clean and whatnot.

回答3:

The above solutions via Beautiful Soup will not work. You might be able to hack something with Beautiful Soup above and beyond them, because Beautiful Soup provides access to the parse tree. In a while, I think I'll try to solve the problem properly, but it's a week-long project or so, and I don't have a free week soon.

Just to be specific, not only will Beautiful Soup throw exceptions for some parsing errors which the above code doesn't catch; but also, there are plenty of very real XSS vulnerabilities that aren't caught, like:

<<script>script> alert("Haha, I hacked your page."); </</script>script>

Probably the best thing that you can do is instead to strip out the < element as <, to prohibit all HTML, and then use a restricted subset like Markdown to render formatting properly. In particular, you can also go back and re-introduce common bits of HTML with a regex. Here's what the process looks like, roughly:

_lt_     = re.compile('<')
_tc_ = '~(lt)~'   # or whatever, so long as markdown doesn't mangle it.     
_ok_ = re.compile(_tc_ + '(/?(?:u|b|i|em|strong|sup|sub|p|br|q|blockquote|code))>', re.I)
_sqrt_ = re.compile(_tc_ + 'sqrt>', re.I)     #just to give an example of extending
_endsqrt_ = re.compile(_tc_ + '/sqrt>', re.I) #html syntax with your own elements.
_tcre_ = re.compile(_tc_)

def sanitize(text):
    text = _lt_.sub(_tc_, text)
    text = markdown(text)
    text = _ok_.sub(r'<\1>', text)
    text = _sqrt_.sub(r'&radic;<span style="text-decoration:overline;">', text)
    text = _endsqrt_.sub(r'</span>', text)
    return _tcre_.sub('&lt;', text)

I haven't tested that code yet, so there may be bugs. But you see the general idea: you have to blacklist all HTML in general before you whitelist the ok stuff.

回答4:

Here is what i use in my own project. The acceptable_elements/attributes come from feedparser and BeautifulSoup does the work.

from BeautifulSoup import BeautifulSoup

acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', 'big',
      'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'code', 'col',
      'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em',
      'font', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 
      'ins', 'kbd', 'label', 'legend', 'li', 'map', 'menu', 'ol', 
      'p', 'pre', 'q', 's', 'samp', 'small', 'span', 'strike',
      'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th',
      'thead', 'tr', 'tt', 'u', 'ul', 'var']

acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey',
  'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing',
  'char', 'charoff', 'charset', 'checked', 'cite', 'clear', 'cols',
  'colspan', 'color', 'compact', 'coords', 'datetime', 'dir', 
  'enctype', 'for', 'headers', 'height', 'href', 'hreflang', 'hspace',
  'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'method',
  'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt', 
  'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'shape', 'size',
  'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type',
  'usemap', 'valign', 'value', 'vspace', 'width']

def clean_html( fragment ):
    while True:
        soup = BeautifulSoup( fragment )
        removed = False        
        for tag in soup.findAll(True): # find all tags
            if tag.name not in acceptable_elements:
                tag.extract() # remove the bad ones
                removed = True
            else: # it might have bad attributes
                # a better way to get all attributes?
                for attr in tag._getAttrMap().keys():
                    if attr not in acceptable_attributes:
                        del tag[attr]

        # turn it back to html
        fragment = unicode(soup)

        if removed:
            # we removed tags and tricky can could exploit that!
            # we need to reparse the html until it stops changing
            continue # next round

        return fragment

Some small tests to make sure this behaves correctly:

tests = [   #text should work
            ('<p>this is text</p>but this too', '<p>this is text</p>but this too'),
            # make sure we cant exploit removal of tags
            ('<<script></script>script> alert("Haha, I hacked your page."); <<script></script>/script>', ''),
            # try the same trick with attributes, gives an Exception
            ('<div on<script></script>load="alert("Haha, I hacked your page.");">1</div>',  Exception),
             # no tags should be skipped
            ('<script>bad</script><script>bad</script><script>bad</script>', ''),
            # leave valid tags but remove bad attributes
            ('<a href="good" onload="bad" onclick="bad" alt="good">1</div>', '<a href="good" alt="good">1</a>'),
]

for text, out in tests:
    try:
        res = clean_html(text)
        assert res == out, "%s => %s != %s" % (text, res, out)
    except out, e:
        assert isinstance(e, out), "Wrong exception %r" % e

回答5:

Bleach does better with more useful options. It's built on html5lib and ready for production.

Check this documents.

回答6:

I modified Bryan's solution with BeautifulSoup to address the problem raised by Chris Drost. A little crude, but does the job:

from BeautifulSoup import BeautifulSoup, Comment

VALID_TAGS = {'strong': [],
              'em': [],
              'p': [],
              'ol': [],
              'ul': [],
              'li': [],
              'br': [],
              'a': ['href', 'title']
              }

def sanitize_html(value, valid_tags=VALID_TAGS):
    soup = BeautifulSoup(value)
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    [comment.extract() for comment in comments]
    # Some markup can be crafted to slip through BeautifulSoup's parser, so
    # we run this repeatedly until it generates the same output twice.
    newoutput = soup.renderContents()
    while 1:
        oldoutput = newoutput
        soup = BeautifulSoup(newoutput)
        for tag in soup.findAll(True):
            if tag.name not in valid_tags:
                tag.hidden = True
            else:
                tag.attrs = [(attr, value) for attr, value in tag.attrs if attr in valid_tags[tag.name]]
        newoutput = soup.renderContents()
        if oldoutput == newoutput:
            break
    return newoutput

Edit: Updated to support valid attributes.

回答7:

I use this:

FilterHTML

It's simple and lets you define a well-controlled white-list, scrubs URLs and even matches attribute values against regex or have custom filtering functions per attribute.

If used carefully it could be a safe solution.

回答8:

You could use html5lib, which uses a whitelist to sanitize.

An example:

import html5lib
from html5lib import sanitizer, treebuilders, treewalkers, serializer

def clean_html(buf):
    """Cleans HTML of dangerous tags and content."""
    buf = buf.strip()
    if not buf:
        return buf

    p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"),
            tokenizer=sanitizer.HTMLSanitizer)
    dom_tree = p.parseFragment(buf)

    walker = treewalkers.getTreeWalker("dom")
    stream = walker(dom_tree)

    s = serializer.htmlserializer.HTMLSerializer(
            omit_optional_tags=False,
            quote_attr_values=True)
    return s.render(stream)

回答9:

I prefer the lxml.html.clean solution, like nosklo points out. Here's to also remove some empty tags:

from lxml import etree
from lxml.html import clean, fromstring, tostring

remove_attrs = ['class']
remove_tags = ['table', 'tr', 'td']
nonempty_tags = ['a', 'p', 'span', 'div']

cleaner = clean.Cleaner(remove_tags=remove_tags)

def squeaky_clean(html):
    clean_html = cleaner.clean_html(html)
    # now remove the useless empty tags
    root = fromstring(clean_html)
    context = etree.iterwalk(root) # just the end tag event
    for action, elem in context:
        clean_text = elem.text and elem.text.strip(' \t\r\n')
        if elem.tag in nonempty_tags and \
        not (len(elem) or clean_text): # no children nor text
            elem.getparent().remove(elem)
            continue
        elem.text = clean_text # if you want
        # and if you also wanna remove some attrs:
        for badattr in remove_attrs:
            if elem.attrib.has_key(badattr):
                del elem.attrib[badattr]
    return tostring(root)

标签：

孤独总比滥情好

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~

Python HTML sanitizer / scrubber / filter [closed]

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

回答8:

回答9:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮