Allow user submitted HTML in PHP

I want to allow a lot of user submitted html for user profiles, I currently try to filter out what I don't want but I am now wanting to change and use a whitelist approach.

Here is my current non-whitelist approach

function FilterHTML($string) {
    if (get_magic_quotes_gpc()) {
        $string = stripslashes($string);
    }
    $string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
    // convert decimal
    $string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
    // convert hex
    $string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
    //$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
    $string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
    $string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
    //$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
    $string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
    $string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*@([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //@IMPORT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
    $string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
    $string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
    $string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
    //$string = str_replace('left:0px; top: 0px;','',$string);
    do {
        $oldstring = $string;
        //bgsound|
        $string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
    } while ($oldstring != $string);
    return addslashes($string);
}

The above works pretty well, I have never had any problems after 2 years of use with it but for a whitelist approach is there anything similars to stackoverflows C# method but in PHP? http://refactormycode.com/codes/333-sanitize-html

标签： php whitelist

7条回答

Fickle 薄情

2楼-- · 2019-01-09 11:54

It's a pretty simple aim to achieve actually - you just need to check for anything that's NOT some tags from a list of whitelisted tags and remove them from the source. It can be done quite easily with one regex.

function sanitize($html) {
  $whitelist = array(
    'b', 'i', 'u', 'strong', 'em', 'a'
  );

  return preg_replace("/<(^".implode("|", $whitelist).")(.*)>(.*)<\/(^".implode("|", $whitelist).")>/", "", $html);
}

I haven't tested this, and there's probably an error in there somewhere but you get the gist of how it works. You might also want to look at using a formatting language such as Textile or Markdown.

Jamie

0人赞添加讨论(0) 举报

萌系小妹纸

3楼-- · 2019-01-09 11:55

For those of you suggesting simply using strip_tags...be aware: strip_tags will NOT strip out tag attributes and broken tags will also mess it up.

From the manual page:

Warning Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected.

Warning This function does not modify any attributes on the tags that you allow using allowable_tags , including the style and onmouseover attributes that a mischievous user may abuse when posting text that will be shown to other users.

You CANNOT rely on just this one solution.

0人赞添加讨论(0) 举报

爷、活的狠高调

4楼-- · 2019-01-09 12:08

HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications.

0人赞添加讨论(0) 举报

虎瘦雄心在

5楼-- · 2019-01-09 12:11

Maybe it is safer to use DOMDocument to analyze it correctly, remove disallowed tags with removeChild() and then get the result. It is not always safe to filter stuff with regular expressions, specially if things start to get such complexity. Hackers can find a way to cheat your filters, forums and social networks do know that very well.

For instance, browsers ignore spaces after the <. Your regex filter <script, but if I use < script... big FAIL!

0人赞添加讨论(0) 举报

不美不萌又怎样

6楼-- · 2019-01-09 12:11

Try this function "getCleanHTML" below, extract text content from the elements with exceptions of elements with tag name in the whitelist. This code is clean and easy to understand and debug.

<?php

$TagWhiteList = array(
    'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);

function getHTMLCode($Node) {
    $Document = new DOMDocument();    
    $Document->appendChild($Document->importNode($Node, true));
    return $Document->saveHTML();
}
function getCleanHTML($Node, $Text = "") {
    global $TagWhiteList;

    $TextName = $Node->tagName;
    if ($TextName == null)
        return $Text.$Node->textContent;

    if (in_array($TextName, $TagWhiteList)) 
        return $Text.getHTMLCode($Node);

    $Node = $Node->firstChild;
    if ($Node != null)
        $Text = getCleanHTML($Node, $Text);

    while($Node->nextSibling != null) {
        $Text = getCleanHTML($Node->nextSibling, $Text);
        $Node = $Node->nextSibling;
    }
    return $Text;
}

$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
echo getCleanHTML($Doc->documentElement)."\n";

?>

Hope this helps.

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

7楼-- · 2019-01-09 12:13

HTML Purifier is the best HTML parser/cleaner out there.

0人赞添加讨论(0) 举报

1 2 下一页

Allow user submitted HTML in PHP

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间