Let's say I have a string from the user ($input
). I can go and strip tags, to allow only allowed tags in. I can convert to text with htmlspecialchars()
. I can even replace all tags I don't want with text.
function html($input) {
$input = '<bl>'.htmlspecialchars($input).'</bl>'; // bl is a custom tag that I style (stands for block)
global $open;
$open = []; //Array of open tags
for ($i = 0; $i < strlen($input); $i++) {
if (!in_array('code', $open) && !in_array('codebl', $open)) { //If we are parsing
$input = preg_replace_callback('#^(.{'.$i.'})<(em|i|del|sub|sup|sml|code|kbd|pre|codebl|quote|bl|sbl)>\s*#s', function($match) {
global $open; //...then add new tags to the array
array_push($open,$match[2]);
return $match[1].'<'.$match[2].'>'; //And replace them
}, $input);
$input = preg_replace_callback('#^(.{'.$i.'})(https?):\/\/([^\s"\(\)<>]+)#', function($m) {
return $m[1].'<a href="'.$m[2].'://'.$m[3].'" target="_blank">'.$m[3].'</a>';
}, $input, -1, $num); //Simple linking
$i += $num * 9;
$input = preg_replace_callback('#^(.{'.$i.'})\n\n#', function($m) {
return $m[1].'</bl><bl>';
}, $input); // More of this bl element
}
if (end($open)) { //Close tags
$input = preg_replace_callback('#^(.{'.$i.'})</('.end($open).')>#s', function($match) {
global $open;
array_pop($open);
return trim($match[1]).'</'.$match[2].'>';
}, $input);
}
}
while ($open) { //Handle unclosed tags
$input .= '</'.end($open).'>';
array_pop($open);
}
return $input;
}
The problem is that after that, there is no way to write literally <i&lgt;</i>
, because it will be automatically parsed into either <i></i>
(if you write <i></i>
), or &lt;i&gt;&lt;/i&gt;
(if you write <i></i>
). I want the user to be able to enter <
(or any other HTML entity) and get <
back. If I just send it straight to the browser unparsed, it would (obviously) be vulnerable to whatever sorcery the hackers are trying (and I'm letting) to (be) put on my site. So, How can I let the user use any of the pre-defined set of HTML tags, while still letting them use html entities?
This is what I eventually used:
function html($input) {
$input = preg_replace(["#&([^A-z])#","#<([^A-z/])#","#&$#","#<$#"], ['&$1','<$1','&','<'], $input); //Fix single "<"s and "&"s
$open = []; //Array of open tags
$close = false; //Is the current tag a close tag?
for ($i = 0; $i <= strlen($input); $i++) { //Start the loop
if ($tag) { //Are we in a tag?
if (preg_match("/[^a-z]/", $input[$i])) { //The tag has ended
if ($close) {
$close = false;
$sPos = strrpos(substr($input,0,$i), '<') + 2; //start position of tag
$tag = substr($input,$sPos,$i-$sPos); //tag name
if (end($open) == $tag) {
array_pop($open); //Good, it's a valid XML closing
} else {
$input = substr($input, 0, $sPos-2) . '</' . $tag . substr($input, $i); //BAD! Convert tag to text (open tag will be handled later)
}
} else {
$sPos = strrpos(substr($input,0,$i), '<') + 1; //start position of tag
$tag = substr($input,$sPos,$i-$sPos); //tag name
if (in_array($tag, ['em','i','del','sub','sup','sml','code','kbd','pre','codebl','bl','sbl'])) { //Is it an acceptable tag?
array_push($open, $tag); //Add it to the array
$j = $i + 1;
while (preg_match("/\s/", $input[$j])) { //Get rid of whitespace
$j++;
}
$input = substr($input, 0, $sPos - 1) . '<' . $tag . '>' . substr($input, $j); //Seems legit
} else {
$input = substr($input, 0, $sPos - 1) . '<' . $tag . substr($input, $i); //BAD! Convert tag to text
}
}
$tag = false;
}
} else if (!in_array('code', $open) && !in_array('codebl', $open) && !in_array('pre', $open)) { //Standard parsing of text
if ($input[$i] == '<') { //Is it a tag?
$tag = true;
if ($input[$i+1] == '/') { //Is it a close tag?
$i++;
$close = true;
}
} else if (substr($input, $i, 4) == 'http') { //Link
if (preg_match('#^.{'.$i.'}(https?):\/\/([^\s"\(\)<>]+)#', $input, $m)) {
$insert = '<a href="'.$m[1].'://'.$m[2].'" target="_blank">'.$m[2].'</a>';
$input = substr($input, 0, $i) . $insert . substr($input, $i + strlen($m[1].'://'.$m[2]));
$i += strlen($insert);
}
} else if ($input[$i] == "\n" && $input[$i+1] == "\n") { //Insert <bl> tag? (I use this to separate sections of text)
$input = substr($input, 0, $i + 1) . '</bl><bl>' . substr($input, $i + 1);
}
} else { // We're in a code tag
if (substr($input, $i+1, strlen(end($open)) + 3) == '</'.current($open).'>') {
array_pop($open);
$i += 2;
} elseif ($input[$i] == '<') {
$input = substr($input, 0, $i) . '<' . substr($input, $i + 1);
$i += 3; //Code tags have raw text
} elseif (in_array('code', $open) && $input[$i] == "\n") { //No linebreaks are allowed in inline tags, convert to <codebl>
$open[count($open) - 1] = 'codebl';
$input = substr($input, 0, strrpos($input,'<code>')) . '<codebl>' . substr($input, strrpos($input,'<code>') + 6, strpos(substr($input, strrpos($input,'<code>')),'</code>') - 6) . '</codebl>' . substr($input, strpos(substr($input, strrpos($input,'<code>')),'</code>') + strrpos($input,'<code>') + 7);
$i += 4;
}
}
}
while ($open) { //Handle open tags
$input .= '</'.end($open).'>';
array_pop($open);
}
return '<bl>'.$input.'</bl>';
}
I know it's a bit more risky, but you can first assume the input's good, then filter out the stuff explicitly found as bad.