When receiving user input on forms I want to detect whether fields like "username" or "address" does not contain markup that has a special meaning in XML (RSS feeds) or (X)HTML (when displayed).
So which of these is the correct way to detect whether the input entered doesn't contain any special characters in HTML and XML context?
if (mb_strpos($data, '<') === FALSE AND mb_strpos($data, '>') === FALSE)
or
if (htmlspecialchars($data, ENT_NOQUOTES, 'UTF-8') === $data)
or
if (preg_match("/[^\p{L}\-.']/u", $text)) // problem: also caches symbols
Have I missed anything else,like byte sequences or other tricky ways to get markup tags around things like "javascript:"? As far as I'm aware, all XSS and CSFR attacks require <
or >
around the values to get the browser to execute the code (well at least from Internet Explorer 6 or later anyway) - is this correct?
I am not looking for something to reduce or filter input. I just want to locate dangerous character sequences when used in XML or HTML context. (strip_tags()
is horribly unsafe. As the manual says, it doesn't check for malformed HTML.)
Update
I think I need to clarify that there are a lot people mistaking this question for a question about basic security via "escaping" or "filtering" dangerous characters. This is not that question, and most of the simple answers given wouldn't solve that problem anyway.
Update 2: Example
- User submits input
if (mb_strpos($data, '<') === FALSE AND mb_strpos($data, '>') === FALSE)
- I save it
Now that the data is in my application I do two things with it - 1) display in a format like HTML - or 2) display inside a format element for editing.
The first one is safe in XML and HTML context
<h2><?php print $input; ?></h2>'
<xml><item><?php print $input; ?></item></xml>
The second form is more dangerous, but it should still be safe:
<input value="<?php print htmlspecialchars($input, ENT_QUOTES, 'UTF-8');?>">
Update 3: Working Code
You can download the gist I created and run the code as a text or HTML response to see what I'm talking about. This simple check passes the http://ha.ckers.org XSS Cheat Sheet, and I can't find anything that makes it though. (I'm ignoring Internet Explorer 6 and below).
I started another bounty to award someone that can show a problem with this approach or a weakness in its implementation.
Update 4: Ask a DOM
It's the DOM that we want to protect - so why not just ask it? Timur's answer lead to this:
function not_markup($string)
{
libxml_use_internal_errors(true);
if ($xml = simplexml_load_string("<root>$string</root>"))
{
return $xml->children()->count() === 0;
}
}
if (not_markup($_POST['title'])) ...
Regex is still the most efficient way of solving your problem. It doesn't matter what frameworks you plan to use, or are advised to use, the most efficient way would still be a custom regex code. You can test the string with a regex, and remove (or convert) the affected section using htmlcharacter function.
No need to install any other framework, or use some long-winded application.