Whenever I try to read a Google alert via PHP using something like:
$feed = file_get_contents("http://www.google.com/alerts/feeds/01445174399729103044/950192755411504138");
Regardless of whether I save the $feed
to a file or echo
the result to the output, all utf-8
unicode characters ( i.e. those with diacritics) are represented by white space. I have tried - without success - various combinations of:
utf8_encode
utf8_decode
iconv
mb_convert_encoding
I think the wrong characters have come from the stream, but I'm lost because if I try this URI in a browser then everything is fine. Can anyone shed some light on the issue?
Sorry, you are absolutely correct - there is something untoward happening! Though it is not what you would first suspect... For reference, given that:
The unicode data is lost before it is even sent by the remote server - it appears that Google is looking at the
user-agent
string in the request header - which is non-existent usingfile_get_contents
by default without a stream-context.Because it cannot identify the client making the request it defaults to and forces ASCII encoding. This is presumably a necessary fallback in the event of some kind of cataclysmic cock-up. [citation needed...]
It's not simply enough to name your application however, you need to include a known vendor. I 'm unsure of the full extent of this but I believe most folks include "Mozilla [version]" to work around the issue, for example: