I am trying to parse the document http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm. I want to get countries and names at the beginning of the document.
Here is my code
import urllib
import re
from bs4 import BeautifulSoup
url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm"
soup=BeautifulSoup(urllib.urlopen(url))
attendances_table=soup.find("table", {"width":850})
print attendances_table #this works, I see the whole table
print attendances_table.find_all("tr")
I get the following error:
AttributeError: 'NoneType' object has no attribute 'next_element'
I then tried to use the same solution as in this post (I know, again, me :p) : beautifulsoup with an invalid html document
I replaced the line:
soup=BeautifulSoup(urllib.urlopen(url))
with:
return BeautifulSoup(html, 'html.parser')
Now if I do:
print attendances_table
I only get:
<table border="0" cellpadding="10" cellspacing="0" width="850">
<tr><td valign="TOP" width="42%">
<p><b><u>Belgium</u></b></p></td></tr></table>
What should I change?
Use
html5lib
as a parser, it is extremely lenient:You would also need to install
html5lib
module first.Demo:
Workaround to make
find_all('tr')
work:Solved!
I just used another parser library,
lxml
. Thank you Martijn Pieters for that!lxml
was the only library that worked for me!