I have a PHP-powered RSS feed caching system. If a feed contains certain characters, e.g. curly quotes/apostrophes, these are coming back in the cURL response wrecked.
Example feed: http://www.theguardian.com/football/hullcity/rss (note curly apostrophes)
cURL code:
$ch = curl_init($url);
curl_setopt_array($ch, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_TIMEOUT => CURL_CONNECT_TIMEOUT
));
Resultant data (extract from):
Sergio Agüero is firing again, José Mourinho’s propaganda ...
Is there some cURL option I should be configuring, or do I have no choice but to string-handle these out after cURL has finished?
I know there's a cURL option CURLOPT_ENCODING
but to my knowledge that's about encoding data sent, not retrieved.
Dealing with encoding in feeds is hard. You have first to identify which encoding the text of the feed uses and then convert it to whatever encoding you want to display it with.
To determine the encoding, you have 2 looks in 2 different places:
Feedparser's documentation is the most explicit on how to deal with this. You could also use services like Superfeedr which will handle the conversion to UTF-8 for you!