I'm working on a web crawler that grabs data from sites all over the world, and is dealing with distinct languages and encodings.
Currently I'm using the following function, and it works in 99% of the cases. But there is this 1% that is giving me headaches.
function convertEncoding($str) {
return iconv(mb_detect_encoding($str), "UTF-8", $str);
}
It's not possible to detect character set of a string in 100% rate since some character sets are subset of some others. Try setting character set explicitly if possible without mixing iconv and mbstring functions. I recommend using a function like this and supplying from charset whenever possible:
Rather than blindly trying to detect the encoding, you should first check if the page that you downloaded has a listed character set. The character set may be set in the HTTP response header, for example:
Or in the HTML as a meta tag, for example:
Only if neither are available then try to guess the encoding with mb_detect_encoding() or other methods.
You can try utf_encode($str).
http://www.php.net/manual/en/function.utf8-encode.php#89789
Or you can replace the content type meta tag with
from header of crawled content