How to remove multiple UTF-8 BOM sequences before

2018-12-31 23:15发布

问题:

Using PHP5 (cgi) to output template files from the filesystem and having issues spitting out raw HTML.

private function fetch($name) {
    $path = $this->j->config[\'template_path\'] . $name . \'.html\';
    if (!file_exists($path)) {
        dbgerror(\'Could not find the template \"\' . $name . \'\" in \' . $path);
    }
    $f = fopen($path, \'r\');
    $t = fread($f, filesize($path));
    fclose($f);
    if (substr($t, 0, 3) == b\'\\xef\\xbb\\xbf\') {
        $t = substr($t, 3);
    }
    return $t;
}

Even though I\'ve added the BOM fix I\'m still having problems with Firefox accepting it. You can see a live copy here: http://ircb.in/jisti/ (and the template file I threw at http://ircb.in/jisti/home.html if you want to check it out)

Any idea how to fix this? o_o

回答1:

you would use the following code to remove utf8 bom

//Remove UTF8 Bom

function remove_utf8_bom($text)
{
    $bom = pack(\'H*\',\'EFBBBF\');
    $text = preg_replace(\"/^$bom/\", \'\', $text);
    return $text;
}


回答2:

try:

// -------- read the file-content ----
$str = file_get_contents($source_file); 

// -------- remove the utf-8 BOM ----
$str = str_replace(\"\\xEF\\xBB\\xBF\",\'\',$str); 

// -------- get the Object from JSON ---- 
$obj = json_decode($str); 

:)



回答3:

Another way to remove the BOM which is Unicode code point U+FEFF

$str = preg_replace(\'/\\x{FEFF}/u\', \'\', $file);


回答4:

b\'\\xef\\xbb\\xbf\' stands for the literal string \"\\xef\\xbb\\xbf\". If you want to check for a BOM, you need to use double quotes, so the \\x sequences are actually interpreted into bytes:

\"\\xef\\xbb\\xbf\"

Your files also seem to contain a lot more garbage than just a single leading BOM:

$ curl http://ircb.in/jisti/ | xxd

0000000: efbb bfef bbbf efbb bfef bbbf efbb bfef  ................
0000010: bbbf efbb bf3c 2144 4f43 5459 5045 2068  .....<!DOCTYPE h
0000020: 746d 6c3e 0a3c 6874 6d6c 3e0a 3c68 6561  tml>.<html>.<hea
...


回答5:

This global funtion resolve for UTF-8 system base charset. Tanks!

function prepareCharset($str) {

    // set default encode
    mb_internal_encoding(\'UTF-8\');

    // pre filter
    if (empty($str)) {
        return $str;
    }

    // get charset
    $charset = mb_detect_encoding($str, array(\'ISO-8859-1\', \'UTF-8\', \'ASCII\'));

    if (stristr($charset, \'utf\') || stristr($charset, \'iso\')) {
        $str = iconv(\'ISO-8859-1\', \'UTF-8//TRANSLIT\', utf8_decode($str));
    } else {
        $str = mb_convert_encoding($str, \'UTF-8\', \'UTF-8\');
    }

    // remove BOM
    $str = urldecode(str_replace(\"%C2%81\", \'\', urlencode($str)));

    // prepare string
    return $str;
}


回答6:

An extra method to do the same job:

function remove_utf8_bom_head($text) {
    if(substr(bin2hex($text), 0, 6) === \'efbbbf\') {
        $text = substr($text, 3);
    }
    return $text;
}

The other methods I found cannot work in my case.

Hope it helps in some special case.



回答7:

If you are reading some API using file_get_contents and got an inexplicable NULL from json_decode, check the value of json_last_error(): sometimes the value returned from file_get_contents will have an extraneous BOM that is almost invisible when you inspect the string, but will make json_last_error() to return JSON_ERROR_SYNTAX (4).

>>> $json = file_get_contents(\"http://api-guiaserv.seade.gov.br/v1/orgao/all\");
=> \"\\t{\"orgao\":[{\"Nome\":\"Tribunal de Justi\\u00e7a\",\"ID_Orgao\":\"59\",\"Condicao\":\"1\"}, ...]}\"
>>> json_decode($json);
=> null
>>>

In this case, check the first 3 bytes - echoing them is not very useful because the BOM is invisible on most settings:

>>> substr($json, 0, 3)
=> \"  \"
>>> substr($json, 0, 3) == pack(\'H*\',\'EFBBBF\');
=> true
>>>

If the line above returns TRUE for you, then a simple test may fix the problem:

>>> json_decode($json[0] == \"{\" ? $json : substr($json, 3))
=> {#204
     +\"orgao\": [
       {#203
         +\"Nome\": \"Tribunal de Justiça\",
         +\"ID_Orgao\": \"59\",
         +\"Condicao\": \"1\",
       },
     ],
     ...
   }


回答8:

if anybody using csv import then below code useful

           $header = fgetcsv($handle);
            foreach($header as $key=> $val) {
                $bom = pack(\'H*\',\'EFBBBF\');
                $val = preg_replace(\"/^$bom/\", \'\', $val);
                $header[$key] = $val;
            }


回答9:

This might help. let me know if you care for me to expand my thought process.

<?php
    //
    // labled TESTINGSTRIPZ.php
    //

    define(\'CHARSET\', \'UTF-8\');

    $stringy = \"\\xef\\xbb\\xbf\\\"quoted text\\\" \";
    $str_find_array    = array( \"\\xef\\xbb\\xbf\");
    $str_replace_array = array(             \'\');


    $RESULT =
        trim(
            mb_convert_encoding(

                str_replace(
                    $str_find_array,
                    $str_replace_array,
                    strip_tags( $stringy )
                    ),

                \'UTF-8\',

                mb_detect_encoding(
                    strip_tags($stringy)
                    )

                )
            );

        print(\"YOUR RESULT IS: \" . $RESULT.PHP_EOL);

?>

Result:

terminal$ php TESTINGSTRIPZ.php 
      YOUR RESULT IS: \"quoted text\" // < with no hidden char.