Curl: get UTF-8 data from site with incorrect char

2019-08-07 12:20发布

I scrape some sites that occasionally have UTF-8 characters in the title, but that don't specify UTF-8 as the charset (qq.com is an example). When I use look at the website in my browser, the data I want to copy (i.e. the title) looks correct (Japanese or Chinese..not too sure). I can copy the title and paste it into the terminal and it looks exactly the same. I can even write it to the DB and when I retrieve from the DB it still looks the same, and correct.

However, when I use cURL, the data that gets printed is wrong. I can run cURL from the command line or use PHP .. when it's printed to the terminal it's clearly incorrect, and it remains that way when I store it to the DB (remember: the terminal can display these characters properly). I've tried all eligible combinations of the following:

Setting CURLOPT_BINARYTRANSFER to true
mb_convert_encoding($html, 'UTF-8')
utf8_encode($html)
utf8_decode($html)

None of these display the characters as expected. This is very frustrating since I can get the right characters so easily just by visiting the site, but cURL can't. I've read a lot of suggestions such as this one: How to get web-page-title with CURL in PHP from web-sites of different CHARSET?

The solution in general seems to be "convert the data to UTF-8." To be honest, I don't actually know what that means. Don't the above functions convert the data to UTF-8? Why isn't it already UTF-8? What is it, and why does it display properly in some circumstances, but not for cURL?

标签： php curl character-encoding

1条回答

贪生不怕死

2楼-- · 2019-08-07 12:43

have you tried :

$html = iconv("gb2312","utf-8",$html);

the gb2312 was taken from the qq.com headers

0人赞添加讨论(0) 举报

Curl: get UTF-8 data from site with incorrect char

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间