I am trying to get search results from yahoo.com.
But file_get_contents() converts UTF-8 charset (charset, that yahoo uses) content to ISO-8859-1.
Try:
$filename = "http://search.yahoo.com/search;_ylt=A0oG7lpgGp9NTSYAiQBXNyoA?p=naj%C5%A1%C5%A5astnej%C5%A1%C3%AD&fr2=sb-top&fr=yfp-t-701&type_param=&rd=pref";
echo file_get_contents($filename);
Scripts as
header('Content-Type: text/html; charset=UTF-8');
or
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
or
$er = mb_convert_encoding($filename , 'UTF-8');
or
$s2 = iconv("ISO-8859-1","UTF-8",$filename );
or
echo utf8_encode(file_get_contents($filename));
NOT help, because after getting web content speciall characters as š ť ž are replaced with question marks ???
I would appreciate any kind of help.
file_get_contents should not change the charset. The data is pulled in as a binary string.
When checking out the url you provided, this is the header it provides:
Also, in the body:
Also, you can't convert UTF-8 losslessly convert to ISO-8859-1 and get the characters back when going back to UTF-8. UTF-8 / unicode supports many many more characters, so the characters are lost in the first step.
In the browser this is not the case, so perhaps you just need to provide a correct Accept-Encoding header to instruct yahoo's system you can accept UTF-8.
For anyone investigating on this:
The time I spent on encoding issues taught me that rarely php functions "magically" change the encoding of strings. (One of these rare examples is :
Please note also that the working header set is as follows:
and not:
As I had a similar issue as the one you describe, it was enough to set the headers properly.
Hope this helps!
This seems to be a content negotiation problem as
file_get_contents
probably sends a request that only accepts ISO 8859-1 as character encoding.You can create a custom stream context for
file_get_contents
usingstream_context_create
that explicitly states that you accept UTF-8:Better solution...