PHP: How to remove all non printable characters in

2019-01-01 07:59发布

I imagine I need to remove chars 0-31 and 127,

Is there a function or piece of code to do this efficiently.

标签: php utf-8 ascii
16条回答
裙下三千臣
2楼-- · 2019-01-01 08:20

Starting with PHP 5.2, we also have access to filter_var, which I have not seen any mention of so thought I'd throw it out there. To use filter_var to strip non-printable characters < 32 and > 127, you can do:

Filter ASCII characters below 32

$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_LOW);

Filter ASCII characters above 127

$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_HIGH);

Strip both:

$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_LOW|FILTER_FLAG_STRIP_HIGH);

You can also html-encode low characters (newline, tab, etc.) while stripping high:

$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_ENCODE_LOW|FILTER_FLAG_STRIP_HIGH);

There are also options for stripping HTML, sanitizing e-mails and URLs, etc. So, lots of options for sanitization (strip out data) and even validation (return false if not valid rather than silently stripping).

Sanitization: http://php.net/manual/en/filter.filters.sanitize.php

Validation: http://php.net/manual/en/filter.filters.validate.php

However, there is still the problem, that the FILTER_FLAG_STRIP_LOW will strip out newline and carriage returns, which for a textarea are completely valid characters...so some of the Regex answers, I guess, are still necessary at times, e.g. after reviewing this thread, I plan to do this for textareas:

$string = preg_replace( '/[^[:print:]\r\n]/', '',$input);

This seems more readable than a number of the regexes that stripped out by numeric range.

查看更多
时光乱了年华
3楼-- · 2019-01-01 08:20

This worked for me. I had to convert a string of any kind which was a random title in to a slug for SEO.

function string2Slug($str){

    $str = trim($str);
    $str = str_replace(" ","_",$str);
    $temp = explode("\\u",$str);
    $str = '';
    foreach ($temp as $bit) {
        $str .= substr($bit,4);
    }

    $str = str_replace("'","",$str);
    $str = str_replace("\"","",$str);
    $str = str_replace("\\","",$str);
    $str = str_replace("\/","",$str);
    $str = str_replace("/","",$str);
    $str = str_replace("?","",$str);
    $str = str_replace("#","",$str);
    $str = str_replace("&","",$str);
    $str = str_replace("%","",$str);
    $str = str_replace("!","",$str);

    return $str;

}
查看更多
刘海飞了
4楼-- · 2019-01-01 08:25

how about:

return preg_replace("/[^a-zA-Z0-9`_.,;@#%~'\"\+\*\?\[\^\]\$\(\)\{\}\=\!\<\>\|\:\-\s\\\\]+/", "", $data);

gives me complete control of what I want to include

查看更多
查无此人
5楼-- · 2019-01-01 08:26

You could use a regular express to remove everything apart from those characters you wish to keep:

$string=preg_replace('/[^A-Za-z0-9 _\-\+\&]/','',$string);

Replaces everything that is not (^) the letters A-Z or a-z, the numbers 0-9, space, underscore, hypen, plus and ampersand - with nothing (i.e. remove it).

查看更多
零度萤火
6楼-- · 2019-01-01 08:30

For anyone that is still looking how to do this without removing the non-printable characters, but rather escaping them, I made this to help out. Feel free to improve it! Characters are escaped to \\x[A-F0-9][A-F0-9].

Call like so:

$escaped = EscapeNonASCII($string);

$unescaped = UnescapeNonASCII($string);

<?php 
  function EscapeNonASCII($string) //Convert string to hex, replace non-printable chars with escaped hex
    {
        $hexbytes = strtoupper(bin2hex($string));
        $i = 0;
        while ($i < strlen($hexbytes))
        {
            $hexpair = substr($hexbytes, $i, 2);
            $decimal = hexdec($hexpair);
            if ($decimal < 32 || $decimal > 126)
            {
                $top = substr($hexbytes, 0, $i);
                $escaped = EscapeHex($hexpair);
                $bottom = substr($hexbytes, $i + 2);
                $hexbytes = $top . $escaped . $bottom;
                $i += 8;
            }
            $i += 2;
        }
        $string = hex2bin($hexbytes);
        return $string;
    }
    function EscapeHex($string) //Helper function for EscapeNonASCII()
    {
        $x = "5C5C78"; //\x
        $topnibble = bin2hex($string[0]); //Convert top nibble to hex
        $bottomnibble = bin2hex($string[1]); //Convert bottom nibble to hex
        $escaped = $x . $topnibble . $bottomnibble; //Concatenate escape sequence "\x" with top and bottom nibble
        return $escaped;
    }

    function UnescapeNonASCII($string) //Convert string to hex, replace escaped hex with actual hex.
    {
        $stringtohex = bin2hex($string);
        $stringtohex = preg_replace_callback('/5c5c78([a-fA-F0-9]{4})/', function ($m) { 
            return hex2bin($m[1]);
        }, $stringtohex);
        return hex2bin(strtoupper($stringtohex));
    }
?>
查看更多
忆尘夕之涩
7楼-- · 2019-01-01 08:32

All of the solutions work partially, and even below probably does not cover all of the cases. My issue was in trying to insert a string into a utf8 mysql table. The string (and its bytes) all conformed to utf8, but had several bad sequences. I assume that most of them were control or formatting.

function clean_string($string) {
  $s = trim($string);
  $s = iconv("UTF-8", "UTF-8//IGNORE", $s); // drop all non utf-8 characters

  // this is some bad utf-8 byte sequence that makes mysql complain - control and formatting i think
  $s = preg_replace('/(?>[\x00-\x1F]|\xC2[\x80-\x9F]|\xE2[\x80-\x8F]{2}|\xE2\x80[\xA4-\xA8]|\xE2\x81[\x9F-\xAF])/', ' ', $s);

  $s = preg_replace('/\s+/', ' ', $s); // reduce all multiple whitespace to a single space

  return $s;
}

To further exacerbate the problem is the table vs. server vs. connection vs. rendering of the content, as talked about a little here

查看更多
登录 后发表回答