Despite utf8 encoding some characters fail to be r

2019-08-08 04:03发布

问题:

I'm trying to scrape an RSS with a news title like this:

<title>Photo of iceberg that is believed to have sunk Titanic sold at auction for £21,000 alongside &amp;#039;world&amp;#039;s most valuable biscuit&amp;#039;</title>

This is effectively how I'm using Beautiful Soup to scrape it:

soup = BeautifulSoup(xml, 'xml')
start = soup.findAll('item')
for i in start:
    news, is_created = News.create_or_update(news_id,                                                  
    head_line=i.title.text.encode('utf-8').strip(),
    ...)

However despite this effort the title remains like this:

Photo of iceberg that is believed to have sunk Titanic sold at auction for \xa321,000 alongside &#039;world&#039;s most valuable biscuit&#039;

Would it be easier just to convert these special characters into ASCII character?

回答1:

For the example you provide, this works for me:

from bs4 import BeautifulSoup
import html

xml='<title>Photo of iceberg that is believed to have sunk Titanic sold at auction for £21,000 alongside &amp;#039;world&amp;#039;s most valuable biscuit&amp;#039;</title>'
soup = BeautifulSoup(xml, 'lxml')
print(html.unescape(soup.get_text()))

html.unescape handles the HTML entities. If Beautiful Soup is not handling the pound sign correctly, you may need to specify the encoding when creating the BeautifulSoup object, e.g.

soup = BeautifulSoup(xml, "lxml", from_encoding='latin-1')


回答2:

I finally believe to have found the problem. These characters above are escaped HTML inside an XML. What a mess. If you look at Independent's RSS most titles are affected like that.

So this is not an UTF8 problem. How can I encode any html characters in my title above before converting to to UTF8?

head_line=i.title.text.encode('utf-8').strip(),

I solved it by unescaping the title with HTMLParser and then encoding it with UTF8. Marco's answer did essentially the same. But the html library didn't work for me.

head_line=HTMLParser.HTMLParser().unescape(i.title.text).encode('utf-8').strip(),

I don't recommend using from_encoding='latin-1' as it causes other problems. The solution with unescaping and encode('utf-8') is enough to decode the £ into \xa3, which is proper Unicode chars.