I'm trying to scrape an RSS with a news title like this:
<title>Photo of iceberg that is believed to have sunk Titanic sold at auction for £21,000 alongside &#039;world&#039;s most valuable biscuit&#039;</title>
This is effectively how I'm using Beautiful Soup to scrape it:
soup = BeautifulSoup(xml, 'xml')
start = soup.findAll('item')
for i in start:
news, is_created = News.create_or_update(news_id,
head_line=i.title.text.encode('utf-8').strip(),
...)
However despite this effort the title remains like this:
Photo of iceberg that is believed to have sunk Titanic sold at auction for \xa321,000 alongside 'world's most valuable biscuit'
Would it be easier just to convert these special characters into ASCII character?
For the example you provide, this works for me:
from bs4 import BeautifulSoup
import html
xml='<title>Photo of iceberg that is believed to have sunk Titanic sold at auction for £21,000 alongside &#039;world&#039;s most valuable biscuit&#039;</title>'
soup = BeautifulSoup(xml, 'lxml')
print(html.unescape(soup.get_text()))
html.unescape
handles the HTML entities. If Beautiful Soup is not handling the pound sign correctly, you may need to specify the encoding when creating the BeautifulSoup
object, e.g.
soup = BeautifulSoup(xml, "lxml", from_encoding='latin-1')
I finally believe to have found the problem. These characters above are escaped HTML inside an XML. What a mess. If you look at Independent's RSS most titles are affected like that.
So this is not an UTF8 problem. How can I encode any html characters in my title above before converting to to UTF8?
head_line=i.title.text.encode('utf-8').strip(),
I solved it by unescaping the title with HTMLParser and then encoding it with UTF8. Marco's answer did essentially the same. But the html
library didn't work for me.
head_line=HTMLParser.HTMLParser().unescape(i.title.text).encode('utf-8').strip(),
I don't recommend using from_encoding='latin-1'
as it causes other problems. The solution with unescaping
and encode('utf-8')
is enough to decode the £ into \xa3
, which is proper Unicode chars.