Sanitize/Rewrite HTML on the Client Side

2019-01-01 07:05发布

问题:

I need to display external resources loaded via cross domain requests and make sure to only display \"safe\" content.

Could use Prototype\'s String#stripScripts to remove script blocks. But handlers such as onclick or onerror are still there.

Is there any library which can at least

  • strip script blocks,
  • kill DOM handlers,
  • remove black listed tags (eg: embed or object).

So are any JavaScript related links and examples out there?

回答1:

Update 2016: There is now a Google Closure package based on the Caja sanitizer.

It has a cleaner API, was rewritten to take into account APIs available on modern browsers, and interacts better with Closure Compiler.


Shameless plug: see caja/plugin/html-sanitizer.js for a client side html sanitizer that has been thoroughly reviewed.

It is white-listed, not black-listed, but the whitelists are configurable as per CajaWhitelists


If you want to remove all tags, then do the following:

var tagBody = \'(?:[^\"\\\'>]|\"[^\"]*\"|\\\'[^\\\']*\\\')*\';

var tagOrComment = new RegExp(
    \'<(?:\'
    // Comment body.
    + \'!--(?:(?:-*[^->])*--+|-?)\'
    // Special \"raw text\" elements whose content should be elided.
    + \'|script\\\\b\' + tagBody + \'>[\\\\s\\\\S]*?</script\\\\s*\'
    + \'|style\\\\b\' + tagBody + \'>[\\\\s\\\\S]*?</style\\\\s*\'
    // Regular name
    + \'|/?[a-z]\'
    + tagBody
    + \')>\',
    \'gi\');
function removeTags(html) {
  var oldHtml;
  do {
    oldHtml = html;
    html = html.replace(tagOrComment, \'\');
  } while (html !== oldHtml);
  return html.replace(/</g, \'&lt;\');
}

People will tell you that you can create an element, and assign innerHTML and then get the innerText or textContent, and then escape entities in that. Do not do that. It is vulnerable to XSS injection since <img src=bogus onerror=alert(1337)> will run the onerror handler even if the node is never attached to the DOM.



回答2:

The Google Caja HTML sanitizer can be made \"web-ready\" by embedding it in a web worker. Any global variables introduced by the sanitizer will be contained within the worker, plus processing takes place in its own thread.

For browsers that do not support Web Workers, we can use an iframe as a separate environment for the sanitizer to work in. Timothy Chien has a polyfill that does just this, using iframes to simulate Web Workers, so that part is done for us.

The Caja project has a wiki page on how to use Caja as a standalone client-side sanitizer:

  • Checkout the source, then build by running ant
  • Include html-sanitizer-minified.js or html-css-sanitizer-minified.js in your page
  • Call html_sanitize(...)

The worker script only needs to follow those instructions:

importScripts(\'html-css-sanitizer-minified.js\'); // or \'html-sanitizer-minified.js\'

var urlTransformer, nameIdClassTransformer;

// customize if you need to filter URLs and/or ids/names/classes
urlTransformer = nameIdClassTransformer = function(s) { return s; };

// when we receive some HTML
self.onmessage = function(event) {
    // sanitize, then send the result back
    postMessage(html_sanitize(event.data, urlTransformer, nameIdClassTransformer));
};

(A bit more code is needed to get the simworker library working, but it\'s not important to this discussion.)

Demo: https://dl.dropbox.com/u/291406/html-sanitize/demo.html



回答3:

Never trust the client. If you\'re writing a server application, assume that the client will always submit unsanitary, malicious data. It\'s a rule of thumb that will keep you out of trouble. If you can, I would advise doing all validation and sanitation in server code, which you know (to a reasonable degree) won\'t be fiddled with. Perhaps you could use a serverside web application as a proxy for your clientside code, which fetches from the 3rd party and does sanitation before sending it to the client itself?

[edit] I\'m sorry, I misunderstood the question. However, I stand by my advice. Your users will probably be safer if you sanitize on the server before sending it to them.



回答4:

You can\'t anticipate every possible weird type of malformed markup that some browser somewhere might trip over to escape blacklisting, so don\'t blacklist. There are many more structures you might need to remove than just script/embed/object and handlers.

Instead attempt to parse the HTML into elements and attributes in a hierarchy, then run all element and attribute names against an as-minimal-as-possible whitelist. Also check any URL attributes you let through against a whitelist (remember there are more dangerous protocols than just javascript:).

If the input is well-formed XHTML the first part of the above is much easier.

As always with HTML sanitisation, if you can find any other way to avoid doing it, do that instead. There are many, many potential holes. If the major webmail services are still finding exploits after this many years, what makes you think you can do better?



回答5:

Now that all major browsers support sandboxed iframes, there is a much simpler way that I think can be secure. I\'d love it if this answer could be reviewed by people who are more familiar with this kind of security issue.

NOTE: This method definitely will not work in IE 9 and earlier. See this table for browser versions that support sandboxing.

The idea is to create a hidden iframe with JavaScript disabled, paste your untrusted HTML into it, and let it parse it. Then you can walk the DOM tree and copy out the tags and attributes that are considered safe.

The whitelists shown here are just examples. What\'s best to whitelist would depend on the application. If you need a more sophisticated policy than just whitelists of tags and attributes, that can be accommodated by this method, though not by this example code.

var tagWhitelist_ = {
  \'A\': true,
  \'B\': true,
  \'BODY\': true,
  \'BR\': true,
  \'DIV\': true,
  \'EM\': true,
  \'HR\': true,
  \'I\': true,
  \'IMG\': true,
  \'P\': true,
  \'SPAN\': true,
  \'STRONG\': true
};

var attributeWhitelist_ = {
  \'href\': true,
  \'src\': true
};

function sanitizeHtml(input) {
  var iframe = document.createElement(\'iframe\');
  if (iframe[\'sandbox\'] === undefined) {
    alert(\'Your browser does not support sandboxed iframes. Please upgrade to a modern browser.\');
    return \'\';
  }
  iframe[\'sandbox\'] = \'allow-same-origin\';
  iframe.style.display = \'none\';
  document.body.appendChild(iframe); // necessary so the iframe contains a document
  iframe.contentDocument.body.innerHTML = input;

  function makeSanitizedCopy(node) {
    if (node.nodeType == Node.TEXT_NODE) {
      var newNode = node.cloneNode(true);
    } else if (node.nodeType == Node.ELEMENT_NODE && tagWhitelist_[node.tagName]) {
      newNode = iframe.contentDocument.createElement(node.tagName);
      for (var i = 0; i < node.attributes.length; i++) {
        var attr = node.attributes[i];
        if (attributeWhitelist_[attr.name]) {
          newNode.setAttribute(attr.name, attr.value);
        }
      }
      for (i = 0; i < node.childNodes.length; i++) {
        var subCopy = makeSanitizedCopy(node.childNodes[i]);
        newNode.appendChild(subCopy, false);
      }
    } else {
      newNode = document.createDocumentFragment();
    }
    return newNode;
  };

  var resultElement = makeSanitizedCopy(iframe.contentDocument.body);
  document.body.removeChild(iframe);
  return resultElement.innerHTML;
};

You can try it out here.

Note that I\'m disallowing style attributes and tags in this example. If you allowed them, you\'d probably want to parse the CSS and make sure it\'s safe for your purposes.

I\'ve tested this on several modern browsers (Chrome 40, Firefox 36 Beta, IE 11, Chrome for Android), and on one old one (IE 8) to make sure it bailed before executing any scripts. I\'d be interested to know if there are any browsers that have trouble with it, or any edge cases that I\'m overlooking.



回答6:

So, it\'s 2016, and I think many of us are using npm modules in our code now. sanitize-html seems like the leading option on npm, though there are others.

Other answers to this question provide great input in how to roll your own, but this is a tricky enough problem that well-tested community solutions are probably the best answer.

Run this on the command line to install: npm install --save sanitize-html

ES5: var sanitizeHtml = require(\'sanitize-html\'); // ... var sanitized = sanitizeHtml(htmlInput);

ES6: import sanitizeHtml from \'sanitize-html\'; // ... let sanitized = sanitizeHtml(htmlInput);



回答7:

String.prototype.sanitizeHTML=function (white,black) {
   if (!white) white=\"b|i|p|br\";//allowed tags
   if (!black) black=\"script|object|embed\";//complete remove tags
   var e=new RegExp(\"(<(\"+black+\")[^>]*>.*</\\\\2>|(?!<[/]?(\"+white+\")(\\\\s[^<]*>|[/]>|>))<[^<>]*>|(?!<[^<>\\\\s]+)\\\\s[^</>]+(?=[/>]))\", \"gi\");
   return this.replace(e,\"\");
}

-black list -> complete remove tag and content

-white list -> retain tags

-other tags are removed but tag content is retained

-all attributes of white list tag\'s (the remaining ones) are removed



回答8:

The Google Caja library suggested above was way too complex to configure and include in my project for a Web application (so, running on the browser). What I resorted to instead, since we already use the CKEditor component, is to use it\'s built-in HTML sanitizing and whitelisting function, which is far more easier to configure. So, you can load a CKEditor instance in a hidden iframe and do something like:

CKEDITOR.instances[\'myCKEInstance\'].dataProcessor.toHtml(myHTMLstring)

Now, granted, if you\'re not using CKEditor in your project this may be a bit of an overkill, since the component itself is around half a megabyte (minimized), but if you have the sources, maybe you can isolate the code doing the whitelisting (CKEDITOR.htmlParser?) and make it much shorter.

http://docs.ckeditor.com/#!/api

http://docs.ckeditor.com/#!/api/CKEDITOR.htmlDataProcessor



回答9:

I recommend cutting frameworks out of your life, it would make things excessively easier for you in the long term.

cloneNode: Cloning a node copies all of its attributes and their values but does NOT copy event listeners.

https://developer.mozilla.org/en/DOM/Node.cloneNode

The following is not tested though I have using treewalkers for some time now and they are one of the most undervalued parts of JavaScript. Here is a list of the node types you can crawl, usually I use SHOW_ELEMENT or SHOW_TEXT.

http://www.w3.org/TR/DOM-Level-2-Traversal-Range/traversal.html#Traversal-NodeFilter

function xhtml_cleaner(id)
{
 var e = document.getElementById(id);
 var f = document.createDocumentFragment();
 f.appendChild(e.cloneNode(true));

 var walker = document.createTreeWalker(f,NodeFilter.SHOW_ELEMENT,null,false);

 while (walker.nextNode())
 {
  var c = walker.currentNode;
  if (c.hasAttribute(\'contentEditable\')) {c.removeAttribute(\'contentEditable\');}
  if (c.hasAttribute(\'style\')) {c.removeAttribute(\'style\');}

  if (c.nodeName.toLowerCase()==\'script\') {element_del(c);}
 }

 alert(new XMLSerializer().serializeToString(f));
 return f;
}


function element_del(element_id)
{
 if (document.getElementById(element_id))
 {
  document.getElementById(element_id).parentNode.removeChild(document.getElementById(element_id));
 }
 else if (element_id)
 {
  element_id.parentNode.removeChild(element_id);
 }
 else
 {
  alert(\'Error: the object or element \\\'\' + element_id + \'\\\' was not found and therefore could not be deleted.\');
 }
}