When receiving user input on forms I want to detect whether fields like "username" or "address" does not contain markup that has a special meaning in XML (RSS feeds) or (X)HTML (when displayed).
So which of these is the correct way to detect whether the input entered doesn't contain any special characters in HTML and XML context?
if (mb_strpos($data, '<') === FALSE AND mb_strpos($data, '>') === FALSE)
or
if (htmlspecialchars($data, ENT_NOQUOTES, 'UTF-8') === $data)
or
if (preg_match("/[^\p{L}\-.']/u", $text)) // problem: also caches symbols
Have I missed anything else,like byte sequences or other tricky ways to get markup tags around things like "javascript:"? As far as I'm aware, all XSS and CSFR attacks require <
or >
around the values to get the browser to execute the code (well at least from Internet Explorer 6 or later anyway) - is this correct?
I am not looking for something to reduce or filter input. I just want to locate dangerous character sequences when used in XML or HTML context. (strip_tags()
is horribly unsafe. As the manual says, it doesn't check for malformed HTML.)
Update
I think I need to clarify that there are a lot people mistaking this question for a question about basic security via "escaping" or "filtering" dangerous characters. This is not that question, and most of the simple answers given wouldn't solve that problem anyway.
Update 2: Example
- User submits input
if (mb_strpos($data, '<') === FALSE AND mb_strpos($data, '>') === FALSE)
- I save it
Now that the data is in my application I do two things with it - 1) display in a format like HTML - or 2) display inside a format element for editing.
The first one is safe in XML and HTML context
<h2><?php print $input; ?></h2>'
<xml><item><?php print $input; ?></item></xml>
The second form is more dangerous, but it should still be safe:
<input value="<?php print htmlspecialchars($input, ENT_QUOTES, 'UTF-8');?>">
Update 3: Working Code
You can download the gist I created and run the code as a text or HTML response to see what I'm talking about. This simple check passes the http://ha.ckers.org XSS Cheat Sheet, and I can't find anything that makes it though. (I'm ignoring Internet Explorer 6 and below).
I started another bounty to award someone that can show a problem with this approach or a weakness in its implementation.
Update 4: Ask a DOM
It's the DOM that we want to protect - so why not just ask it? Timur's answer lead to this:
function not_markup($string)
{
libxml_use_internal_errors(true);
if ($xml = simplexml_load_string("<root>$string</root>"))
{
return $xml->children()->count() === 0;
}
}
if (not_markup($_POST['title'])) ...
If the reason of the question is for XSS prevention, there are several ways to explode a XSS vulnerability. A great cheatsheet about this is the XSS Cheatsheet at ha.ckers.org.
But, detection is useless in this case. You only need prevention, and the correct use of htmlspecialchars/htmlentities on your text inputs before saving them to your database is faster and better than detecting bad input.
You could use a regular expression if you know the character sets that are allowed. IF a character is in the username that isn't allowed then throw an error:
Test your regular expressions here: http://www.perlfect.com/articles/regextutor.shtml
filter_input + FILTER_SANITIZE_STRING (there are lots of flag you can chose from)
:- http://www.php.net/manual/en/filter.filters.sanitize.php
You can make use of the strip_tags function in PHP. This function will strip HTML and PHP tags from given data.
For example, $data is the variable which holds your content then you can use this like this:
It will check stripped content against the original content. If both are equal then we can hope there aren't any HTML tags, and it returns true. Otherwise, it returns false as it found some HTML tags.
HTML Purifier does a good job and is very easy to implement. You could also use a Zend Framework filter like Zend_Filter_StripTags.
HTML Purifier doesn't just fix HTML.
The correct way to detect whether string inputs contain HTML tags, or any other markup that has a special meaning in XML or (X)HTML when displayed (other than being an entity) is simply
if (mb_strpos($data, '<') === FALSE AND mb_strpos($data, '>') === FALSE)
You are correct! All XSS and CSFR attacks require < or > around the values to get the browser to execute the code (at least from IE6+).
Considering the output context given, this is sufficient to safely display in a format like HTML:
<h2><?php print $input; ?></h2> <xml><item><?php print $input; ?></item></xml>
Of course, if we have any entity in the input, like
á
, a browser will not output it asá
, but asá
, unless we use a function likehtmlspecialchars
when doing the output. In this case, even the<
and>
would be also safe.In the case of using the string input as the value of an attribute, the safety depends on the attribute.
If the attribute is an input value, we must quote it and use a function like
htmlspecialchars
in order to have the same content back for editing.<input value="<?php print htmlspecialchars($input, ENT_QUOTES, 'UTF-8');?>">
Again, even the
<
and>
characters would be safe here.We may conclude that we do not have to do any kind of detection and rejection of the input, if we will always use
htmlspecialchars
to output it, and our context will fit always the above cases (or equally safe ones).[And we also have a number of ways to safely store it in the database, preventing SQL exploits.]
What if the user wants his "username" to be
& is not an &
? It does not contain<
nor>
... will we detect and reject it? Will we accept it? How will we display it? (This input gives interesting results in the new bounty!)Finally, if our context expands, and we will use the string input as an anchor href, then our whole approach suddenly changes dramatically. But this scenario is not included in the question.
(It worths mentioning that even using
htmlspecialchars
the output of a string input may differ if the character encodings are different on each step.)