I have a feed taken from 3rd party sites, and sometimes I have to apply utf8_decode
and other times utf8_encode
to get the desired visible output.
If by mistake the same stuff is applied twice/or the wrong method is used I get something more ugly, this is what I want to change.
How can I detect when what have to apply on the string?
UPDATE
Actually the content returns UTF-8, but inside there are parts that are not.
I can't say I can rely on mb_detect_encoding()
. Had some freaky false positives a while back.
The most universal way I found to work well in every case was:
if (preg_match('!!u', $string))
{
// this is utf-8
}
else
{
// definitely not utf-8
}
You can use
mb_detect_encoding
— Detect character encoding
The charset might also be available in the HTTP Response Headers or in the Response data itself.
Example:
var_dump(
mb_detect_encoding(
file_get_contents('http://stackoverflow.com/questions/4407854')
),
$http_response_header
);
Output (codepad):
string(5) "UTF-8"
array(9) {
[0]=>
string(15) "HTTP/1.1 200 OK"
[1]=>
string(33) "Cache-Control: public, max-age=11"
[2]=>
string(38) "Content-Type: text/html; charset=utf-8"
[3]=>
string(38) "Expires: Fri, 10 Dec 2010 10:40:07 GMT"
[4]=>
string(44) "Last-Modified: Fri, 10 Dec 2010 10:39:07 GMT"
[5]=>
string(7) "Vary: *"
[6]=>
string(35) "Date: Fri, 10 Dec 2010 10:39:55 GMT"
[7]=>
string(17) "Connection: close"
[8]=>
string(21) "Content-Length: 34119"
}
function str_to_utf8 ($str) {
$decoded = utf8_decode($str);
if (mb_detect_encoding($decoded , 'UTF-8', true) === false)
return $str;
return $decoded;
}
var_dump(str_to_utf8("« Chrétiens d'Orient » : la RATP fait marche arrière"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)
var_dump(str_to_utf8("« Chrétiens d'Orient » : la RATP fait marche arrière"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)
The feed (I guess you mean some kind of xml based feed) should have an attribute in the header telling you what the encoding is. If not, you are out of luck as you don't have a reliable means of identifying the encoding.
Encoding autotection is not bullet-proof but you can try mb_detect_encoding()
. See also mb_check_encoding()
.