I'm trying to scrape an RSS with a news title like this:
<title>Photo of iceberg that is believed to have sunk Titanic sold at auction for £21,000 alongside &#039;world&#039;s most valuable biscuit&#039;</title>
This is effectively how I'm using Beautiful Soup to scrape it:
soup = BeautifulSoup(xml, 'xml')
start = soup.findAll('item')
for i in start:
news, is_created = News.create_or_update(news_id,
head_line=i.title.text.encode('utf-8').strip(),
...)
However despite this effort the title remains like this:
Photo of iceberg that is believed to have sunk Titanic sold at auction for \xa321,000 alongside 'world's most valuable biscuit'
Would it be easier just to convert these special characters into ASCII character?
For the example you provide, this works for me:
html.unescape
handles the HTML entities. If Beautiful Soup is not handling the pound sign correctly, you may need to specify the encoding when creating theBeautifulSoup
object, e.g.I finally believe to have found the problem. These characters above are escaped HTML inside an XML. What a mess. If you look at Independent's RSS most titles are affected like that.
So this is not an UTF8 problem. How can I encode any html characters in my title above before converting to to UTF8?
I solved it by unescaping the title with HTMLParser and then encoding it with UTF8. Marco's answer did essentially the same. But the
html
library didn't work for me.I don't recommend using
from_encoding='latin-1'
as it causes other problems. The solution withunescaping
andencode('utf-8')
is enough to decode the £ into\xa3
, which is proper Unicode chars.