Ok, this subject is a hotbed I understand that. I also understand that this situation is dependent on what you are using as code. I have three situations that need to be resolved.
I have a form in where we need to allow people to make comments and statements that use commas, tildes, etc... but still remain safe from attacks.
I have people entering in dates like this: 10/13/11 mm/dd/yy in English, can this be sanitized?
How do I understand how to use htmlspecialchars()
, htmlentities()
and real_escape_string()
correctly? I've read the php.net site and some posts here but this seems to me to be a situation in where it all depends on the person reading the question what the right answer is.
I really can't accept that... there has to be an answer wherein text formats similar to that which I am posting here can be sanitized. I'd like to know if and how it is possible.
Thanks... because it seems to me that when asking this question in other places it tends to annoy... I am learning what I need to know but I think I have hit a plateau in what I can know without an example of what it is meant to do...
Thanks in advance.
It's a very important question and it actually has a simple answer in the form of encodings. The problem you are facing it that you use a lot of languages at the same time. First you are in HTML, then in PHP and a few seconds later in SQL. All these languages have their own syntax rules.
The thing to remember is: a string should at all times be in its proper encoding.
Lets take an example. You have a HTML form and the user enters the following string into it:
I really <3 dogs & cats ;')
Upon pressing the submit button, this string is being send to your PHP script. Lets assume this is done through GET. It gets appended to the URL, which has its own syntax (the & character has special meaning for instance) so we are changing languages. This means the string must be transformed into the proper URL-encoding. In this case the browser does it, but PHP also has an urlencode
function for that.
In the PHP script, the string is stored in $_GET
, encoded as a PHP string. As long as you are coding PHP, this is perfectly fine. But now lets put the string to use in a SQL query. We change languages and syntax rules, therefore the string must be encoded as SQL through the mysql_real_escape_string
function.
At the other end, we might want to display the string back to the users again. We retrieve the string from the database and it is returned to us as a PHP string. When we want to embed it in HTML for output, we're changing languages again so we must encode our string to HTML through the htmlspecialchars
function.
Throughout the way, the string has always been in the proper encoding, which means any character the user can come up with will be dealt with accordingly. Everything should be running smooth and safe.
A thing to avoid (sometimes this is even recommended by the ignorant) is prematurely encoding your string. For instance, you could apply htmlspecialchars
to the string before putting it in the database. This way, when you retrieve the string later from the database you can stick it in the HTML no problem. Sound great? Yeah, really great until you start getting support tickets of people wondering why their PDF receipts are full of & >
junk.
In code:
form.html:
<form action="post.php" method="get">
<textarea name="comment">
I really <3 dogs & cats ;')
</textarea>
<input type="submit"/>
</form>
URL it generates:
http://www.example.org/form.php?comment=I%20really%20%3C3%20dogs%20&%20cats%20;')
post.php:
// Connect to database, etc....
// Place the new comment in the database
$comment = $_GET['comment']; // Comment is encoded as PHP string
// Using $comment in a SQL query, need to encode the string to SQL first!
$query = "INSERT INTO posts SET comment='". mysql_real_escape_string($comment) ."'";
mysql_query($query);
// Get list of comments from the database
$query = "SELECT comment FROM posts";
print '<html><body><h2>Posts</h2>';
print '<table>';
while($post = mysql_fetch_assoc($query)) {
// Going from PHP string to HTML, need to encode!
print '<tr><td>'. htmlspecialchars($post['comment']) .'</td></tr>';
}
print '</table>';
print '</body></html>'
The crucial thing is to understand what each sanitising function available to you is for, and when it should be used. For example, database-escaping functions are designed to make data safe to insert into the database, and should be used as such; but HTML-escaping functions are designed to neutralise malicious HTML code (like JavaScripts) and make it safe to output data for your users to view. Sanitise the right thing at the right time.*
- There are two different basic approaches you can take: you can sanitise HTML when you receive it, or you can store it exactly as you received it and sanitise it only when it is time to output it to the user. Each of these methods has its proponents, but the second one is probably the least prone to problems (with the first one, what do you do if a flaw is discovered in your sanitising procedure and you find you have insufficiently sanitised content stored in your database?)
Dates can be sanitised using a date parsing function. In PHP you might look at strtotime(). Your objective is typically to take a string representation of a date and output either an object representing a date, or another string that represents the same date in a canonical way (that is: in a specific format).
Regarding the sanitization of dates, PHP has some built-in functions that can be helpful. The strtotime() function will convert just about any imaginable date/time format into a Unix timestamp, which can then be passed to the date() function to convert it to whatever formatting you like.
For example:
$date_sql = date( "Y-m-d", strtotime( $_POST["date"] ) );