beautifulsoup and invalid html document

I am trying to parse the document http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm. I want to get countries and names at the beginning of the document.

Here is my code

import urllib
import re
from bs4 import BeautifulSoup
url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm"
soup=BeautifulSoup(urllib.urlopen(url))
attendances_table=soup.find("table", {"width":850})
print attendances_table #this works, I see the whole table
print attendances_table.find_all("tr")

I get the following error:

AttributeError: 'NoneType' object has no attribute 'next_element'

I then tried to use the same solution as in this post (I know, again, me :p) : beautifulsoup with an invalid html document

I replaced the line:

soup=BeautifulSoup(urllib.urlopen(url))

with:

return BeautifulSoup(html, 'html.parser')

Now if I do:

print attendances_table

I only get:

<table border="0" cellpadding="10" cellspacing="0" width="850">
<tr><td valign="TOP" width="42%">
<p><b><u>Belgium</u></b></p></td></tr></table>

What should I change?

标签： python html parsing html-parsing beautifulsoup

2条回答

闹够了就滚

2楼-- · 2019-05-05 20:00

Use html5lib as a parser, it is extremely lenient:

soup = BeautifulSoup(urllib.urlopen(url), 'html5lib')

You would also need to install html5lib module first.

Demo:

>>> from bs4 import BeautifulSoup
>>> import urllib
>>> url = "http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm"
>>> soup = BeautifulSoup(urllib.urlopen(url), 'html5lib')
>>> attendances_table = soup.find("table", {"width": 850})
>>> print attendances_table
<table border="0" cellpadding="10" cellspacing="0" width="850">
<tbody><tr><td valign="TOP" width="42%">
<p><b><u>Belgium</u></b>:</p>
<p>Mr Philippe MAYSTADT</p></td>
<td valign="TOP" width="58%">
<p>Deputy Prime Minister, Minister for Finance and Foreign Trade</p></td>
</tr>
...
<tr><td valign="TOP" width="42%">
<b><u></u></b><u></u><p><u><b>Portugal</b></u>:</p>
<p>Mr António de SOUSA FRANCO</p>
<p>Mr Fernando TEIXEIRA dos SANTOS</p></td>
<td valign="TOP" width="58%">
<p>Minister for Finance</p>
<p>State Secretary for the Treasury and Finance</p></td>
</tr>
</tbody></table>

Workaround to make find_all('tr') work:

>>> attendances_table = BeautifulSoup(str(attendances_table), 'html5lib')
>>> print attendances_table.find_all("tr")
[<tr><td valign="TOP" width="42%">
<p><b><u>Belgium</u></b>:</p>
<p>Mr Philippe MAYSTADT</p></td>
...
<tr><td valign="TOP" width="42%">
<b><u></u></b><u></u><p><u><b>Portugal</b></u>:</p>
<p>Mr AntÃ³nio de SOUSA FRANCO</p>
<p>Mr Fernando TEIXEIRA dos SANTOS</p></td>
<td valign="TOP" width="58%">
<p>Minister for Finance</p>
<p>State Secretary for the Treasury and Finance</p></td>
</tr>]

0人赞添加讨论(0) 举报

傲

3楼-- · 2019-05-05 20:14

Solved!

I just used another parser library, lxml. Thank you Martijn Pieters for that!

soup = BeautifulSoup(urllib.urlopen(url), 'lxml')

lxml was the only library that worked for me!

0人赞添加讨论(0) 举报

beautifulsoup and invalid html document

Solved!

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间