I am using Python and Beautifulsoup to parse HTML-Data and get p-tags out of RSS-Feeds. However, some urls cause problems because the parsed soup-object does not include all nodes of the document.
For example I tried to parse http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm
But after comparing the parsed object with the pages source code, I noticed that all nodes after ul class="nextgen-left"
are missing.
Here is how I parse the Documents:
from bs4 import BeautifulSoup as bs
url = 'http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
request = urllib2.Request(url)
response = opener.open(request)
soup = bs(response,'lxml')
print soup
The input HTML is not quite conformant, so you'll have to use a different parser here. The
html5lib
parser handles this page correctly: