Before everyone tells me that I shouldn't do client-side sanitization (I do in fact intend to do it on a client, though it could work in SSJS as well), let me clarify what I'm trying to do.
I'd like something, akin to Google Caja or HTMLPurifier but for JavaScript: a whitelist-based security approach which processes HTML and CSS (not already inserted into the DOM of course, which would not be safe, but first obtained in string form) and then selectively filters out unsafe tags or attributes, ignoring them or optionally including them as escaped text or otherwise allowing them to be reported to the application for further processing, ideally in context. It would be cool if it could reduce any JavaScript to a safe subset as well, as in Google Caja, but I know that would be asking a lot.
My use case is accessing untrusted XML/XHTML data obtained via JSONP (data from Mediawiki wikis before wiki processing, thereby allowing for raw but untrusted XML/HTML input) and allowing the user to make queries and transformations upon that data (XQuery, jQuery, XSLT, etc.), taking advantage of HTML5 for allowing offline use, IndexedDB storage, etc., and which can then allow the results to be previewed on the same page where the user has viewed the input source and built or imported their queries.
The user can produce whatever output they want, so I won't sanitize what they are doing--if they want to inject JavaScript into the page, all power to them. But I do want to protect users who want to have confidence that they can add code which safely copies over targeted elements from the untrusted input, while disallowing them from copying unsafe input.
This should definitely be doable, but I am wondering if there are any libraries which already do this.
And if I am stuck implementing this on my own (though I'm curious in either case), I'd like to have proof about whether using innerHTML
or DOM creation/appending BEFORE insertion into the document is safe in every way. For example, can events be accidentally triggered if I first ran DOMParser
or relied on browser HTML parsing by using innerHTML
to append raw HTML to a non-inserted div? I believe it should be safe, but not sure if DOM manipulation events could occur somehow before insertion which could be exploited.
Of course, the constructed DOM would need to be sanitized after that point, but I just want to verify I can safely build the DOM object itself for easier traversal and then worry about filtering out unwanted elements, attributes, and attribute values.
Thanks!