I am currently in a project with a PHP frontend. We're pretty concerned about security, because we'll have quite a lot of users and are an attractive target for hackers. Our users are able to submit HTML formatted content that is visible to other users later. This is a big problem because we're vulnerable for the whole set of XSS attacks. We're filtering as good as we can, but the variety of attack vectors is pretty big.
So, I'm searching for PHP based HTML sanitizing/filtering solutions. Commercial solutions are fine (even preferred). Currently we're using a modified HTML purifier, but we're not satisfied with the results.
What are some good libraries/tools that are capable of filtering malicious parts of HTML?
It is nice to have for example HTML5 awareness, which will become a security nightmare once it's available "in the wild".
Update:
We're doing an in-depth configuration of HTML Purifier. It looks like the older framework we used before was just not configuring it at all. Now the results look much better.
HTML Purifier project
Personally I have had very good results with the HTML Purifier project
It is highly customizable and has a huge code base. The only issue is uploading the files to your server.
Are you sure you have not got a configuration issue with your installation? As the purifier should not let through any HTML tags at all if configured correctly.
From the web site:
HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove all
malicious code (better known as
XSS) with a thoroughly audited,
secure yet permissive whitelist, it
will also make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
Tired of using BBCode due to the
current landscape of deficient or
insecure HTML filters? Have a
WYSIWYG editor but never been able to
use it? Looking for high-quality,
standards-compliant, open-source
components for that application
you're building? HTML Purifier is for
you!
I wrote an article about how to use the HTML purifier library with CodeIgniter here.
Maybe it will help with giving it another try:
// load the config and overide defaults as necessary
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');
$config->set('HTML', 'AllowedElements', 'a,em,blockquote,p,strong,pre,code');
$config->set('HTML', 'AllowedAttributes', 'a.href,a.title');
$config->set('HTML', 'TidyLevel', 'light');
CodeIgniter has an excellent XSS filter, you could rip it out of the system/libraries/Input.php file if you wanted it as a standalone function.
kses works well. You can easily specify which elements to allow and disallow, so making it ‘HTML5-aware’ would just be a matter of setting an array.
WordPress uses it, so I guess it’s pretty safe ;)
I can really recommend kses for HTML filtering. Actually that's what wordpress uses. Its free and open source.
I've used this class before and had pretty decent success:
http://www.phpclasses.org/browse/package/2189.html
You can use your current solution and add iframes with different base URLs to show the contents. Changing the base URL on the iframe will disable access from the internal JavaScript code to the main page. That is, if your URL is http://www.yoururl.com/thread/500
you can use it in the iframe to show content, something like: http//yoururl.com/thread/500/coment/1, http//yoururl.com/thread/500/coment/2.
The base URL you can set can be dependent on your DNS/host configuration.
It's not a solution to fix the problem but to jump it over, although it can be useful until you find something else.
HTMLPurifier probably works—but let me just say that the folder structure is over-complicated and pompous. Hundreds of lines of comments, a folder called "test", a license file, read-mes and info files, images, ANOTHER folder for smoketesting (which is downright abusive), extras, configs, benchmarks—and to top it all off, about 10 different CMS compatibility modes, testimonials on their website, full versions, lite versions, husky, mildly-chubby, down-syndrome and the full spectrum of politically correct programatical variations.