PHP - HTML Purifier - hello wrld/world tutorial

2019-01-22 08:11发布

问题:

I am just looking into using HTML Purifier to ensure that a user-inputed string (that represents the name of a person) is sanitized.

I do not want to allow any html tags, script, markup etc - I just want the alpha, numeric and normal punctuation characters.

The sheer number of options available for HTML Purifier is daunting and, as far as i can see, the docs do not seem to have a beggining/middle or end

see: http://htmlpurifier.org/docs

Is there a simple hello world tutorial online for HTML Purifier that shows how to sanitize a string removing all the bad stuff out of it.

I am also considering just using strip tags:

  • http://php.net/manual/en/function.strip-tags.php

or PHP's in built data sanitizing

  • http://us.php.net/manual/en/book.filter.php

回答1:

I've been using HTMLPurifier for sanitizing the output of a rich text editor, and ended up with:

include_once('htmlpurifier/library/HTMLPurifier.auto.php');

$config = HTMLPurifier_Config::createDefault();
$config->set('Core', 'Encoding', 'UTF-8');
$config->set('HTML', 'Doctype', 'HTML 4.01 Transitional');

if (defined('PURIFIER_CACHE')) {
    $config->set('Cache', 'SerializerPath', PURIFIER_CACHE);
} else {
    # Disable the cache entirely
    $config->set('Cache', 'DefinitionImpl', null);
}

# Help out the Purifier a bit, until it develops this functionality
while (($cleaner = preg_replace('!<(em|strong)>(\s*)</\1>!', '$2', $input)) != $input) {
    $input = $cleaner;
}

$filter = new HTMLPurifier($config);
$output = $filter->purify($input);

The main points of interest:

  1. Include the autoloader.
  2. Create an instance of HTMLPurifier_Config as $config.
  3. Set configuration settings as needed, with $config->set().
  4. Create an instance of HTMLPurifier, passing $config to it.
  5. Use $filter->purify() on your input.

However, it's entirely overkill for something that doesn't need to allow any HTML in the output.



回答2:

You should do input validation based on the content - for example rather use some regexp for name

'/([A-Z][a-z]+[ ]?)+/' //ascii only, but not problematic to extend

this validation should do the job well. And then escape the output when printing it on page, with preferred htmlspecialchars.



回答3:

You can use someting like htmlspecialchars() to preserve the characters the user typed in without the browser interpreting.



回答4:

I've always thought Codeigniter's xss cleaning class was quite good, but more recently I've turned to Kohana.

Have a look at their xss_clean method

http://github.com/kohana/core/blob/c443c44922ef13421f4a3af5b414e19091bbdce9/classes/kohana/security.php



回答5:

HTMLpurifier in action. You can opt to write <?php echo "HELLO";?> in fname and WORLD in lname and check the output.

<?php
include( 'htmlpurifier/htmlpurifier/library/HTMLPurifier.auto.php');
?>
<form method="post">
<input type="text" name="fname" placeholder="first name"><br>
<input type="text" name="lname" placeholder="last name"><br>
<input type="submit" name="submit" value="submit">
</form>
        
<?php
if(isset($_POST['submit']))
{
    $fname=$_POST['fname'];
    $lname=$_POST['lname'];
    
    $config = HTMLPurifier_Config::createDefault();
    $purifier = new HTMLPurifier($config);
    $fname = $purifier->purify($fname);
    
    $config = HTMLPurifier_Config::createDefault();
    $purifier = new HTMLPurifier($config);
    $lname = $purifier->purify($lname);

    echo "First name is: ".$fname."<br>";
    echo "Last name is: ".$lname;
}



回答6:

The easiest way to remove all non-alphanumeric characters from a string i think is to use RegEx.Replace() as follows:

Regex.Replace(stringToCleanUp, "[\W]", "");

While \w (lowercase) matches any ‘word’ character, equivalent to [a-zA-Z0-9_] \W matches any ‘non-word’ character, ie. anything NOT matched by \w. The code above will use \W (uppercase) and replace the findings with nothing.

As an alternative if you don’t want to allow the underscore you can use [^a-zA-Z0-9], like this:

Regex.Replace(stringToCleanUp, "[^a-zA-Z0-9]", "");



回答7:

If you are trying to evade code injection attacks, just scape the data and store and print it like the user entered.

For example: If you want to avoid SQL Injection problems in MySQL, use the mysql_real_escape_string() function or similar to sanitize the SQL sentence. *

Another example: Writing data to a HTML document, parse the data with html_entities(), so the data will appears like enter by the user.

  • Check: http://www.php.net/manual/en/security.database.sql-injection.php


回答8:

For simplicity, you can either use strip_tags(), or replace occurrences of <, >, and & with &lt;, &gt;, and &amp;, respectively. This definitely isn't the best solution, but the quickest.



回答9:

i usually clean all user input before sending to my database with the following

mysql_reql_escape_string( htmlentities( strip_tags($str) ));


回答10:

Found this a week ago... LOVE it.

"A simple PHP HTML DOM parser written in PHP5+, supports invalid HTML, and provides a very easy way to handle HTML elements." http://simplehtmldom.sourceforge.net/

// Example
$html = str_get_html("<div>foo <b>bar</b></div>");
$e = $html->find("div", 0);

echo $e->tag; // Returns: " div"
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"

You can also loop through and remove individual tags, etc. The docs and examples are pretty good... I found it easy to use in quite a few places. :-)