Let's say I have a string from the user ($input
). I can go and strip tags, to allow only allowed tags in. I can convert to text with htmlspecialchars()
. I can even replace all tags I don't want with text.
function html($input) {
$input = '<bl>'.htmlspecialchars($input).'</bl>'; // bl is a custom tag that I style (stands for block)
global $open;
$open = []; //Array of open tags
for ($i = 0; $i < strlen($input); $i++) {
if (!in_array('code', $open) && !in_array('codebl', $open)) { //If we are parsing
$input = preg_replace_callback('#^(.{'.$i.'})<(em|i|del|sub|sup|sml|code|kbd|pre|codebl|quote|bl|sbl)>\s*#s', function($match) {
global $open; //...then add new tags to the array
array_push($open,$match[2]);
return $match[1].'<'.$match[2].'>'; //And replace them
}, $input);
$input = preg_replace_callback('#^(.{'.$i.'})(https?):\/\/([^\s"\(\)<>]+)#', function($m) {
return $m[1].'<a href="'.$m[2].'://'.$m[3].'" target="_blank">'.$m[3].'</a>';
}, $input, -1, $num); //Simple linking
$i += $num * 9;
$input = preg_replace_callback('#^(.{'.$i.'})\n\n#', function($m) {
return $m[1].'</bl><bl>';
}, $input); // More of this bl element
}
if (end($open)) { //Close tags
$input = preg_replace_callback('#^(.{'.$i.'})</('.end($open).')>#s', function($match) {
global $open;
array_pop($open);
return trim($match[1]).'</'.$match[2].'>';
}, $input);
}
}
while ($open) { //Handle unclosed tags
$input .= '</'.end($open).'>';
array_pop($open);
}
return $input;
}
The problem is that after that, there is no way to write literally <i&lgt;</i>
, because it will be automatically parsed into either <i></i>
(if you write <i></i>
), or &lt;i&gt;&lt;/i&gt;
(if you write <i></i>
). I want the user to be able to enter <
(or any other HTML entity) and get <
back. If I just send it straight to the browser unparsed, it would (obviously) be vulnerable to whatever sorcery the hackers are trying (and I'm letting) to (be) put on my site. So, How can I let the user use any of the pre-defined set of HTML tags, while still letting them use html entities?
This is what I eventually used:
I know it's a bit more risky, but you can first assume the input's good, then filter out the stuff explicitly found as bad.