Parse HTML user input

Let's say I have a string from the user ($input). I can go and strip tags, to allow only allowed tags in. I can convert to text with htmlspecialchars(). I can even replace all tags I don't want with text.

function html($input) {
    $input = '<bl>'.htmlspecialchars($input).'</bl>'; // bl is a custom tag that I style (stands for block)
    global $open;
    $open = []; //Array of open tags
    for ($i = 0; $i < strlen($input); $i++) {
        if (!in_array('code', $open) && !in_array('codebl', $open)) { //If we are parsing
            $input = preg_replace_callback('#^(.{'.$i.'})&lt;(em|i|del|sub|sup|sml|code|kbd|pre|codebl|quote|bl|sbl)&gt;\s*#s', function($match) {
                global $open; //...then add new tags to the array
                array_push($open,$match[2]);
                return $match[1].'<'.$match[2].'>'; //And replace them
            }, $input);
            $input = preg_replace_callback('#^(.{'.$i.'})(https?):\/\/([^\s"\(\)<>]+)#', function($m) {
                return $m[1].'<a href="'.$m[2].'://'.$m[3].'" target="_blank">'.$m[3].'</a>';
            }, $input, -1, $num); //Simple linking
            $i += $num * 9;
            $input = preg_replace_callback('#^(.{'.$i.'})\n\n#', function($m) {
                return $m[1].'</bl><bl>';
            }, $input); // More of this bl element
        }
        if (end($open)) { //Close tags
            $input = preg_replace_callback('#^(.{'.$i.'})&lt;/('.end($open).')&gt;#s', function($match) {
                global $open;
                array_pop($open);
                return trim($match[1]).'</'.$match[2].'>';
            }, $input);
        }
    }
    while ($open) { //Handle unclosed tags
        $input .= '</'.end($open).'>';
        array_pop($open);
    }
    return $input;
}

The problem is that after that, there is no way to write literally <i&lgt;, because it will be automatically parsed into either  (if you write ), or &amplt;i&ampgt;&amplt;/i&ampgt; (if you write ). I want the user to be able to enter < (or any other HTML entity) and get < back. If I just send it straight to the browser unparsed, it would (obviously) be vulnerable to whatever sorcery the hackers are trying (and I'm letting) to (be) put on my site. So, How can I let the user use any of the pre-defined set of HTML tags, while still letting them use html entities?

标签： php html validation parsing user-generated-content

1条回答

萌系小妹纸

2楼-- · 2019-07-28 23:37

This is what I eventually used:

function html($input) {
    $input = preg_replace(["#&([^A-z])#","#<([^A-z/])#","#&$#","#<$#"], ['&amp;$1','&lt;$1','&amp;','&lt;'], $input); //Fix single "<"s and "&"s
    $open = []; //Array of open tags
    $close = false; //Is the current tag a close tag?
    for ($i = 0; $i <= strlen($input); $i++) { //Start the loop
        if ($tag) { //Are we in a tag?
            if (preg_match("/[^a-z]/", $input[$i])) { //The tag has ended
                if ($close) {
                    $close = false;
                    $sPos = strrpos(substr($input,0,$i), '<') + 2; //start position of tag
                    $tag = substr($input,$sPos,$i-$sPos); //tag name
                    if (end($open) == $tag) {
                        array_pop($open); //Good, it's a valid XML closing
                    } else {
                        $input = substr($input, 0, $sPos-2) . '&lt;/' . $tag . substr($input, $i); //BAD! Convert tag to text (open tag will be handled later)
                    }
                } else {
                    $sPos = strrpos(substr($input,0,$i), '<') + 1; //start position of tag
                    $tag = substr($input,$sPos,$i-$sPos); //tag name
                    if (in_array($tag, ['em','i','del','sub','sup','sml','code','kbd','pre','codebl','bl','sbl'])) { //Is it an acceptable tag?
                        array_push($open, $tag); //Add it to the array
                        $j = $i + 1;
                        while (preg_match("/\s/", $input[$j])) { //Get rid of whitespace
                            $j++;
                        }
                        $input = substr($input, 0, $sPos - 1) . '<' . $tag . '>' . substr($input, $j); //Seems legit
                    } else {
                        $input = substr($input, 0, $sPos - 1) . '&lt;' . $tag . substr($input, $i); //BAD! Convert tag to text
                    }
                }
                $tag = false;
            }
        } else if (!in_array('code', $open) && !in_array('codebl', $open) && !in_array('pre', $open)) { //Standard parsing of text
            if ($input[$i] == '<') { //Is it a tag?
                $tag = true;
                if ($input[$i+1] == '/') { //Is it a close tag?
                    $i++;
                    $close = true;
                }
            } else if (substr($input, $i, 4) == 'http') { //Link
                if (preg_match('#^.{'.$i.'}(https?):\/\/([^\s"\(\)<>]+)#', $input, $m)) {
                    $insert = '<a href="'.$m[1].'://'.$m[2].'" target="_blank">'.$m[2].'</a>';
                    $input = substr($input, 0, $i) . $insert . substr($input, $i + strlen($m[1].'://'.$m[2]));
                    $i += strlen($insert);
                }
            } else if ($input[$i] == "\n" && $input[$i+1] == "\n") { //Insert <bl> tag? (I use this to separate sections of text)
                $input = substr($input, 0, $i + 1) . '</bl><bl>' . substr($input, $i + 1);
            }
        } else { // We're in a code tag
            if (substr($input, $i+1, strlen(end($open)) + 3) == '</'.current($open).'>') {
                array_pop($open);
                $i += 2;
            } elseif ($input[$i] == '<') {
                $input = substr($input, 0, $i) . '&lt;' . substr($input, $i + 1);
                $i += 3; //Code tags have raw text
            } elseif (in_array('code', $open) && $input[$i] == "\n") { //No linebreaks are allowed in inline tags, convert to <codebl>
                $open[count($open) - 1] = 'codebl';
                $input = substr($input, 0, strrpos($input,'<code>')) . '<codebl>' . substr($input, strrpos($input,'<code>') + 6, strpos(substr($input, strrpos($input,'<code>')),'</code>') - 6) . '</codebl>' . substr($input, strpos(substr($input, strrpos($input,'<code>')),'</code>') + strrpos($input,'<code>') + 7);
                $i += 4;
            }
        }
    }
    while ($open) { //Handle open tags
        $input .= '</'.end($open).'>';
        array_pop($open);
    }
    return '<bl>'.$input.'</bl>';
}

I know it's a bit more risky, but you can first assume the input's good, then filter out the stuff explicitly found as bad.

0人赞添加讨论(0) 举报

Parse HTML user input

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间