JavaScript-based X/HTML & CSS sanitization

Before everyone tells me that I shouldn't do client-side sanitization (I do in fact intend to do it on a client, though it could work in SSJS as well), let me clarify what I'm trying to do.

I'd like something, akin to Google Caja or HTMLPurifier but for JavaScript: a whitelist-based security approach which processes HTML and CSS (not already inserted into the DOM of course, which would not be safe, but first obtained in string form) and then selectively filters out unsafe tags or attributes, ignoring them or optionally including them as escaped text or otherwise allowing them to be reported to the application for further processing, ideally in context. It would be cool if it could reduce any JavaScript to a safe subset as well, as in Google Caja, but I know that would be asking a lot.

My use case is accessing untrusted XML/XHTML data obtained via JSONP (data from Mediawiki wikis before wiki processing, thereby allowing for raw but untrusted XML/HTML input) and allowing the user to make queries and transformations upon that data (XQuery, jQuery, XSLT, etc.), taking advantage of HTML5 for allowing offline use, IndexedDB storage, etc., and which can then allow the results to be previewed on the same page where the user has viewed the input source and built or imported their queries.

The user can produce whatever output they want, so I won't sanitize what they are doing--if they want to inject JavaScript into the page, all power to them. But I do want to protect users who want to have confidence that they can add code which safely copies over targeted elements from the untrusted input, while disallowing them from copying unsafe input.

This should definitely be doable, but I am wondering if there are any libraries which already do this.

And if I am stuck implementing this on my own (though I'm curious in either case), I'd like to have proof about whether using innerHTML or DOM creation/appending BEFORE insertion into the document is safe in every way. For example, can events be accidentally triggered if I first ran DOMParser or relied on browser HTML parsing by using innerHTML to append raw HTML to a non-inserted div? I believe it should be safe, but not sure if DOM manipulation events could occur somehow before insertion which could be exploited.

Of course, the constructed DOM would need to be sanitized after that point, but I just want to verify I can safely build the DOM object itself for easier traversal and then worry about filtering out unwanted elements, attributes, and attribute values.

Thanks!

The purpose of the ESAPI is to provide a simple interface that provides all the security functions a developer is likely to need in a clear, consistent, and easy to use way. The ESAPI architecture is very simple, just a collection of classes that encapsulate the key security operations most applications need.

JavaScript version of OWASP ESAPI: http://code.google.com/p/owasp-esapi-js

Input validation is extremely difficult to do effectively, HTML is easily the worst mashup of code and data of all time, as there are so many possible places to put code and so many different valid encodings. HTML is particularly difficult because it is not only hierarchical, but also contains many different parsers (XML, HTML, JavaScript, VBScript, CSS, URL, etc...). While input validation is important and should always be performed, it is not a complete solution for injection attacks. It's better to use escaping as your primary defense. I haven't used HTML Purifier before but it looks good and they certainly have put a lot of time and thought into it. Why not use their solution server side first, then apply any additional rules you'd like after that. I've seen some hacks that use nothing but combinations of [ ] ( ) to write code with. 100s of more examples here XSS (Cross Site Scripting) Cheat Sheet and The Open Web Application Security Project (OWASP). Some things to watch out for DOM based XSS Prevention Cheat Sheet.

HTML Purifier catches this mixed encoding hack

<A HREF="h
tt  p://6&#9;6.000146.0x7.147/">XSS</A>

And this DIV background-image with unicoded XSS exploit

<DIV STYLE="background-image:\0075\0072\006C\0028'\006a\0061\0076\0061\0073\0063\0072\0069\0070\0074\003a\0061\006c\0065\0072\0074\0028.1027\0058.1053\0053\0027\0029'\0029">

A bit of what you're up against: all 70 possible combinations of the character "<" in HTML and JavaScript

<
%3C
&lt
&lt;
&LT
&LT;
&#60
&#060
&#0060
&#00060
&#000060
&#0000060
&#60;
&#060;
&#0060;
&#00060;
&#000060;
&#0000060;
&#x3c
&#x03c
&#x003c
&#x0003c
&#x00003c
&#x000003c
&#x3c;
&#x03c;
&#x003c;
&#x0003c;
&#x00003c;
&#x000003c;
&#X3c
&#X03c
&#X003c
&#X0003c
&#X00003c
&#X000003c
&#X3c;
&#X03c;
&#X003c;
&#X0003c;
&#X00003c;
&#X000003c;
&#x3C
&#x03C
&#x003C
&#x0003C
&#x00003C
&#x000003C
&#x3C;
&#x03C;
&#x003C;
&#x0003C;
&#x00003C;
&#x000003C;
&#X3C
&#X03C
&#X003C
&#X0003C
&#X00003C
&#X000003C
&#X3C;
&#X03C;
&#X003C;
&#X0003C;
&#X00003C;
&#X000003C;
\x3c
\x3C
\u003c
\u003C