How to detect if have to apply utf8 decode or enco

2019-01-08 14:46发布

问题:

I have a feed taken from 3rd party sites, and sometimes I have to apply utf8_decode and other times utf8_encode to get the desired visible output.

If by mistake the same stuff is applied twice/or the wrong method is used I get something more ugly, this is what I want to change.

How can I detect when what have to apply on the string?

UPDATE

Actually the content returns UTF-8, but inside there are parts that are not.

回答1:

I can't say I can rely on mb_detect_encoding(). Had some freaky false positives a while back.

The most universal way I found to work well in every case was:

if (preg_match('!!u', $string))
{
   // this is utf-8
}
else 
{
   // definitely not utf-8
}


回答2:

You can use

  • mb_detect_encoding — Detect character encoding

The charset might also be available in the HTTP Response Headers or in the Response data itself.

Example:

var_dump(
    mb_detect_encoding(
        file_get_contents('http://stackoverflow.com/questions/4407854')
    ),
    $http_response_header
);

Output (codepad):

string(5) "UTF-8"
array(9) {
  [0]=>
  string(15) "HTTP/1.1 200 OK"
  [1]=>
  string(33) "Cache-Control: public, max-age=11"
  [2]=>
  string(38) "Content-Type: text/html; charset=utf-8"
  [3]=>
  string(38) "Expires: Fri, 10 Dec 2010 10:40:07 GMT"
  [4]=>
  string(44) "Last-Modified: Fri, 10 Dec 2010 10:39:07 GMT"
  [5]=>
  string(7) "Vary: *"
  [6]=>
  string(35) "Date: Fri, 10 Dec 2010 10:39:55 GMT"
  [7]=>
  string(17) "Connection: close"
  [8]=>
  string(21) "Content-Length: 34119"
}


回答3:

function str_to_utf8 ($str) {
    $decoded = utf8_decode($str);
    if (mb_detect_encoding($decoded , 'UTF-8', true) === false)
        return $str;
    return $decoded;
}

var_dump(str_to_utf8("« Chrétiens d'Orient » : la RATP fait marche arrière"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)
var_dump(str_to_utf8("« Chrétiens d'Orient » : la RATP fait marche arrière"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)


回答4:

The feed (I guess you mean some kind of xml based feed) should have an attribute in the header telling you what the encoding is. If not, you are out of luck as you don't have a reliable means of identifying the encoding.



回答5:

Encoding autotection is not bullet-proof but you can try mb_detect_encoding(). See also mb_check_encoding().