I want to allow users to create tiny templates that I then render in Django with a predefined context. I am assuming the Django rendering is safe (I asked a question about this before), but there is still the risk of cross-site-scripting, and I'd like to prevent this. One of the main requirements of these templates is that the user should have some control over the layout of the page, not just it's semantics. I see a couple of solutions:
- Allow the user to use HTML, but filter out dangerous tags manually in the final step (things like
<script>
and <a onclick='..'>
. I'm not so enthusiastic about this option, because I'm afraid I might overlook some tags. Even then, the user could still use absolute positioning on <divs>
to mess up a thing or two on the rest of the page.
- Use a markup language that produces safe HTML. From what I can see, in most markup languages, I could strip any html, and then process the result. The problem with this is that most markup languages are not very powerful layout-wise. As far as I could see there is no way to center elements in Markdown, not even in ReST. The pro here is that some markup languages are well-documented, and users might already know how to use them.
- Come up with some proprietary markup. The cons I see here are pretty much all implied by the word proprietary.
So, to summarize: Is there some safe and easy way to "purify" HTML — preventing xss — or is there a reasonably ubiquitous markup language that gives some control over layout and styling.
Resources:
- My earlier question about Django templates
- Class names in markdown.
There's PHP-Based HTML purifier, I have not used it myself yet but heard very good things about it. They promise a lot:
HTML Purifier is a standards-compliant
HTML filter library written in
PHP. HTML Purifier will not only remove all malicious
code (better known as XSS) with a thoroughly audited,
secure yet permissive whitelist,
it will also make sure your documents are
standards compliant, something only achievable with a
comprehensive knowledge of W3C's specifications.
Maybe it's worth a try even though it's not Python based. Update: @Matchu has found a Python based alternative that looks good too.
You'll have a lot of very difficult edge cases, though, just think about Flash embeds. Plus, malicious uses of position: absolute
are extremely difficult to track down (there's position: relative
that could achieve the same effect, but also be a completely legitimate layout tool.) Maybe take a look at what - for example - EBay allow, and don't allow? If anybody has the necessary experience to know what's dangerous and what isn't from millions of examples, they do.
Related resources on EBay:
HTML & JavaScript with examples
Site Interference it's unclear, though, what is just forbidden, and what gets filtered
From what I found, they don't seem to publish their internal HTML blacklists, but output an error message if forbidden code is found. (Probably a wise move on their part, but unfortunate for the purposes of this question.)
Seeing Pekka's answer, I tried to quickly Google an HTML Purifier equivalent in Python. Here's what I came up with: Python HTML Sanitizer. At first glance, it looks pretty good to me.
"Use a markup language that produces safe HTML."
Clearly, the only sensible approach.
"The problem with this is that most markup languages are not very powerful layout-wise."
False.
"no way to center elements in ReST."
False.
Centering is a style -- a CSS feature -- not a markup feature.
The want to center is to assign an CSS Class to a piece of text. The .. class::
directive does this.
You can also define your own interpreted text role, if that's necessary for specifying an inline class on a piece of <span>
markup.
You are overlooking server side security issues. You need to be very careful that users can't use the templates import or include mechanism to access files they don't have permission to.
The bigger challenge is to prevent the template system from infinite loops and recursion. This is an obvious threat to system performance, but depending on the implementation and deployment setup, the server may never timeout. With a finite number of python threads at your disposal, repeated calls to a misbehaving template could quickly bring your site down.