How can I remove attributes from an html tag?

2020-01-26 10:08发布

问题:

How can I use php to strip all/any attributes from a tag, say a paragraph tag?

<p class="one" otherrandomattribute="two"> to <p>

回答1:

Although there are better ways, you could actually strip arguments from html tags with a regular expression:

<?php
function stripArgumentFromTags( $htmlString ) {
    $regEx = '/([^<]*<\s*[a-z](?:[0-9]|[a-z]{0,9}))(?:(?:\s*[a-z\-]{2,14}\s*=\s*(?:"[^"]*"|\'[^\']*\'))*)(\s*\/?>[^<]*)/i'; // match any start tag

    $chunks = preg_split($regEx, $htmlString, -1,  PREG_SPLIT_DELIM_CAPTURE);
    $chunkCount = count($chunks);

    $strippedString = '';
    for ($n = 1; $n < $chunkCount; $n++) {
        $strippedString .= $chunks[$n];
    }

    return $strippedString;
}
?>

The above could probably be written in less characters, but it does the job (quick and dirty).



回答2:

Strip attributes using SimpleXML (Standard in PHP5)

<?php

// define allowable tags
$allowable_tags = '<p><a><img><ul><ol><li><table><thead><tbody><tr><th><td>';
// define allowable attributes
$allowable_atts = array('href','src','alt');

// strip collector
$strip_arr = array();

// load XHTML with SimpleXML
$data_sxml = simplexml_load_string('<root>'. $data_str .'</root>', 'SimpleXMLElement', LIBXML_NOERROR | LIBXML_NOXMLDECL);

if ($data_sxml ) {
    // loop all elements with an attribute
    foreach ($data_sxml->xpath('descendant::*[@*]') as $tag) {
        // loop attributes
        foreach ($tag->attributes() as $name=>$value) {
            // check for allowable attributes
            if (!in_array($name, $allowable_atts)) {
                // set attribute value to empty string
                $tag->attributes()->$name = '';
                // collect attribute patterns to be stripped
                $strip_arr[$name] = '/ '. $name .'=""/';
            }
        }
    }
}

// strip unallowed attributes and root tag
$data_str = strip_tags(preg_replace($strip_arr,array(''),$data_sxml->asXML()), $allowable_tags);

?>


回答3:

Here is one function that will let you strip all attributes except ones you want:

function stripAttributes($s, $allowedattr = array()) {
  if (preg_match_all("/<[^>]*\\s([^>]*)\\/*>/msiU", $s, $res, PREG_SET_ORDER)) {
   foreach ($res as $r) {
     $tag = $r[0];
     $attrs = array();
     preg_match_all("/\\s.*=(['\"]).*\\1/msiU", " " . $r[1], $split, PREG_SET_ORDER);
     foreach ($split as $spl) {
      $attrs[] = $spl[0];
     }
     $newattrs = array();
     foreach ($attrs as $a) {
      $tmp = explode("=", $a);
      if (trim($a) != "" && (!isset($tmp[1]) || (trim($tmp[0]) != "" && !in_array(strtolower(trim($tmp[0])), $allowedattr)))) {

      } else {
          $newattrs[] = $a;
      }
     }
     $attrs = implode(" ", $newattrs);
     $rpl = str_replace($r[1], $attrs, $tag);
     $s = str_replace($tag, $rpl, $s);
   }
  }
  return $s;
}

In example it would be:

echo stripAttributes('<p class="one" otherrandomattribute="two">');

or if you eg. want to keep "class" attribute:

echo stripAttributes('<p class="one" otherrandomattribute="two">', array('class'));

Or

Assuming you are to send a message to an inbox and you composed your message with CKEDITOR, you can assign the function as follows and echo it to the $message variable before sending. Note the function with the name stripAttributes() will strip off all html tags that are unnecessary. I tried it and it work fine. i only saw the formatting i added like bold e.t.c.

$message = stripAttributes($_POST['message']);

or you can echo $message; for preview.



回答4:

HTML Purifier is one of the better tools for sanitizing HTML with PHP.



回答5:

I honestly think that the only sane way to do this is to use a tag and attribute whitelist with the HTML Purifier library. Example script here:

<html><body>

<?php

require_once '../includes/htmlpurifier-4.5.0-lite/library/HTMLPurifier/Bootstrap.php';
spl_autoload_register(array('HTMLPurifier_Bootstrap', 'autoload'));

$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.Allowed', 'p,b,a[href],i,br,img[src]');
$config->set('URI.Base', 'http://www.example.com');
$config->set('URI.MakeAbsolute', true);

$purifier = new HTMLPurifier($config);

$dirty_html = "
  <a href=\"http://www.google.de\">broken a href link</a
  fnord

  <x>y</z>
  <b>c</p>
  <script>alert(\"foo!\");</script>

  <a href=\"javascript:alert(history.length)\">Anzahl besuchter Seiten</a>
  <img src=\"www.example.com/bla.gif\" />
  <a href=\"http://www.google.de\">missing end tag
 ende 
";

$clean_html = $purifier->purify($dirty_html);

print "<h1>dirty</h1>";
print "<pre>" . htmlentities($dirty_html) . "</pre>";

print "<h1>clean</h1>";
print "<pre>" . htmlentities($clean_html) . "</pre>";

?>

</body></html>

This yields the following clean, standards-conforming HTML fragment:

<a href="http://www.google.de">broken a href link</a>fnord

y
<b>c
<a>Anzahl besuchter Seiten</a>
<img src="http://www.example.com/www.example.com/bla.gif" alt="bla.gif" /><a href="http://www.google.de">missing end tag
ende 
</a></b>

In your case the whitelist would be:

$config->set('HTML.Allowed', 'p');


回答6:

You might also look into html purifier. True, it's quite bloated, and might not fit your needs if it only conceirns this specific example, but it offers more or less 'bulletproof' purification of possible hostile html. Also you can choose to allow or disallow certain attributes (it's highly configurable).

http://htmlpurifier.org/