Encoding: certain characters coming back wrecked t

2019-08-30 09:09发布

问题:

I have a PHP-powered RSS feed caching system. If a feed contains certain characters, e.g. curly quotes/apostrophes, these are coming back in the cURL response wrecked.

Example feed: http://www.theguardian.com/football/hullcity/rss (note curly apostrophes)

cURL code:

$ch = curl_init($url);
curl_setopt_array($ch, array(
    CURLOPT_RETURNTRANSFER => 1,
    CURLOPT_TIMEOUT => CURL_CONNECT_TIMEOUT
));

Resultant data (extract from):

Sergio Agüero is firing again, José Mourinho’s propaganda ...

Is there some cURL option I should be configuring, or do I have no choice but to string-handle these out after cURL has finished?

I know there's a cURL option CURLOPT_ENCODING but to my knowledge that's about encoding data sent, not retrieved.

回答1:

Dealing with encoding in feeds is hard. You have first to identify which encoding the text of the feed uses and then convert it to whatever encoding you want to display it with.

To determine the encoding, you have 2 looks in 2 different places:

  • HTTP headers
  • XML declaration

Feedparser's documentation is the most explicit on how to deal with this. You could also use services like Superfeedr which will handle the conversion to UTF-8 for you!