Simple html dom character encoding issue

2020-02-11 06:35发布

问题:

hey guys, i'm using simple html dom to retrieve content from another website, but the thing is theres a character encoding issue with the stuff retrieved using simple html dom. The characters are showing up as the little diamond with the question mark inside.

The character encoding issue only happens with the content retrieved, and all other text on my site is displaying fine.

If anyone could help that would be great.

回答1:

Try using iconv to convert the charset of the scraped text to the charset you use on your page.

Signature:

string iconv ( string $in_charset , string $out_charset , string $str )

Example:

echo iconv("ISO-8859-1", "UTF-8", $text);


回答2:

I had this problem too, but it was not the charset problem.It was gzip compression that simple html dom doesn't handle. Here is my solution. Use the function file_get_html2 instead file_get_html.

function curl($url){
    $headers[]  = "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13";
    $headers[]  = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
    $headers[]  = "Accept-Language:en-us,en;q=0.5";
    $headers[]  = "Accept-Encoding:gzip,deflate";
    $headers[]  = "Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $headers[]  = "Keep-Alive:115";
    $headers[]  = "Connection:keep-alive";
    $headers[]  = "Cache-Control:max-age=0";

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $url);
    curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($curl, CURLOPT_ENCODING, "gzip");
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
    $data = curl_exec($curl);
    curl_close($curl);
    return $data;

}
function file_get_html2($url){
    return str_get_html(curl($url));
}


回答3:

Go to website and check their charset by viewing page info.

$text = iconv(mb_detect_encoding($text), "UTF-8//TRANSLIT//IGNORE", $text);