Simple html dom character encoding issue

2020-02-11 06:25发布

hey guys, i'm using simple html dom to retrieve content from another website, but the thing is theres a character encoding issue with the stuff retrieved using simple html dom. The characters are showing up as the little diamond with the question mark inside.

The character encoding issue only happens with the content retrieved, and all other text on my site is displaying fine.

If anyone could help that would be great.

3条回答
迷人小祖宗
2楼-- · 2020-02-11 07:00

Go to website and check their charset by viewing page info.

$text = iconv(mb_detect_encoding($text), "UTF-8//TRANSLIT//IGNORE", $text);
查看更多
Emotional °昔
3楼-- · 2020-02-11 07:09

Try using iconv to convert the charset of the scraped text to the charset you use on your page.

Signature:

string iconv ( string $in_charset , string $out_charset , string $str )

Example:

echo iconv("ISO-8859-1", "UTF-8", $text);
查看更多
家丑人穷心不美
4楼-- · 2020-02-11 07:17

I had this problem too, but it was not the charset problem.It was gzip compression that simple html dom doesn't handle. Here is my solution. Use the function file_get_html2 instead file_get_html.

function curl($url){
    $headers[]  = "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13";
    $headers[]  = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
    $headers[]  = "Accept-Language:en-us,en;q=0.5";
    $headers[]  = "Accept-Encoding:gzip,deflate";
    $headers[]  = "Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $headers[]  = "Keep-Alive:115";
    $headers[]  = "Connection:keep-alive";
    $headers[]  = "Cache-Control:max-age=0";

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $url);
    curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($curl, CURLOPT_ENCODING, "gzip");
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
    $data = curl_exec($curl);
    curl_close($curl);
    return $data;

}
function file_get_html2($url){
    return str_get_html(curl($url));
}
查看更多
登录 后发表回答