I would appreciate an answer to settle a disagreement between me and some co-workers.
We have a typical PHP / LAMP web application.
The only input we want from users is plain text. We do not invite or want users to enter HTML at any point. Form elements are mostly basic input text tags. There might be a few textareas, checkboxes etc.
There is currently no sanitizing of output to pages. All dynamic content, some of which came from user input, is simply echoed to the page. We obviously need to make it safe.
My solution is to use htmlspecialchars on all output at the time it is echoed on the page.
My co-workers' solution is to add HTML Purifier to the database layer. They want to pass all user entered input through HTML Purifier before it is saved to the database. Apparently they've used it like this on other projects but I think that is a misunderstanding of what HTML Purifier is for.
My understanding is that it only makes sense to use HTML Purifier on a site which allows the user to enter HTML. It takes HTML and makes it safer and cleaner based on a whitelist and other rules.
Who's right and who's wrong?
There's also the whole "escape on input or output" issue but I guess that's a debate for another time and place.
Thanks
As a general rule, escaping should be done for context and for use-case.
If what you want to do is output plain text in an HTML context (and you do), then you need to use escaping functionality that will ensure that you will always output plain text in an HTML context. Given basic PHP, that would indeed be
htmlspecialchars($yourString, ENT_QUOTES, 'yourEncoding');
.If what you want to do is output HTML in an HTML context (you don't), then you would want to santitise the HTML when you output it to prevent it from doing damage - here you would
$purifier->purify($yourString);
on output.If you want to store plain text user input in a database (again, you do) by executing SQL statements, then you should either use prepared statements to prevent SQL injection, or an escaping function specific to your DB, such as
mysql_real_escape_string($yourString)
.You should not:
Of those, all are outright harmful, albeit to different degrees. Note that the following assumes the database is your only or canonical storage medium for the data (it also assumes you have SQL injection taken care of in some other way - if you don't, that'll be your primary issue):
<script>
tags? Your user can't do that - you'll destroy that part of his message!Sanitising as HTML when you're outputting data as plain text (without also escaping it) may have confusing, page-breaking results if you don't set your sanitising module to strip all HTML (which you shouldn't, since then you clearly don't want to be outputting HTML).
Did you sanitise for a
<div>
context, but are putting your data into an inline element? Your user might put a<div>
into your inline element, forcing a layout break into your page layout (how annoying this is depends on your layout), or to influence user perception of metadata (for example to make phishing easier), e.g. like this:(Site admin)
Did you sanitise for a
<span>
context? The user could use other tags to influence user perception of metadata, e.g. like this:Worst-case scenario: Did you sanitise your HTML with a version of HTML Purifier that later turns out to have a bug that does allow a certain kind of malicious HTML to survive? Now you're outputting untrusted data and putting users that view this data on your web page at risk.
Sanitising as HTML and escaping for HTML (in that order!) does not have this problem, but it means the sanitising step is unnecessary, meaning this constellation will just cost you performance. (Presumably that's why your colleague wanted to do the sanitising when saving the data, not when displaying it - presumably your use-case (like most) will display the data more often than the data will be submitted, meaning you would avoid having to deal with the performance hit frequently.)
tl;dr
Sanitising as HTML when you're outputting as plain text is not a good idea.
Escape / sanitise for use-case and context.
In your situation, you want to escape plain text for an HTML context (= use
htmlspecialchars()
).