beautifulsoup with an invalid html document

I am trying to parse the document http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/5923en8.htm. I want to extract everything before Commission:. enter image description here

(I need Beautifulsoup because the second step is to extract countries and person names)

If i do:

import urllib
import re
from bs4 import BeautifulSoup
url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/5923en8.htm"
soup=BeautifulSoup(urllib.urlopen(url))
print soup.find_all(text=re.compile("Commission"))

The only result I get is :

[u'The Governments of the Member States and the European Commission were represented as follows:']

This is the first occurrence of the word, but not the line I am looking for. I think it's because the document is not valid, but not sure. If I look at the source code :

<B><U><P>Commission</B></U>:</P>

But if I do a print of soup, I can see the text, with tags reordered:

<u><b>Commission</b></u>

How can I get this element "Commission:"?

I am using python 2.7 and Beautifoulsoup 4.3.2.

EDIT: SOLVED!

As suggested alecxe, I replaced the line:

soup=BeautifulSoup(urllib.urlopen(url))

with

BeautifulSoup(urllib.urlopen(url), 'html.parser')

It works now :). Thanks to everyone.

EDIT: similar problems

I have founs similar problems with the same solution:

Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds

Beautiful Soup findAll doen't find them all

标签： python html parsing html-parsing beautifulsoup

2条回答

做自己的国王

2楼-- · 2019-08-28 15:44

If you want everything before the tag with "Commision:" value. You could just do it without beatifulsoup... and just treat it lika a string variable and search for the right keyword and drop the rest of the string.

But when I run your code I get following:

[u'The Governments of the Member States and the European Commission were represe
nted as follows:', u'Commission', u'The Council held an orientation debate on ke
y economic policy issues with a view to giving guidance to the Commission on the
 questions Ministers wish to be addressed in the broad economic policy guideline
s 1998/99 for which the Commission will present its recommandation later in the
Spring. It was noted that the forthcoming guidelines are of particular importanc
e given the start of stage 3 of EMU.', u'The debate was based on an assessment o
f the economic situation and outlook in the Community carried out by the Commiss
ion and the Economic Policy and Monetary Committees.', u"The Council held an ori
entation debate on the Commission's Communication setting out a possible Communi
ty framework allowing Member States to experiment with reduced VAT rates for lab
our-intensive services in order to boost employment in small businesses without
distorting international competition. ", u'This Communication was tabled by the
Commission as a follow-up to the Employment European Council of last November in
 Luxembourg, which concluded that, in order to make the taxation system more emp
loyment-friendly, "Member States will examine, without obligation, the advisabil
ity of reducing the rate of VAT on labour-intensive services not exposed to cros
s-border competition".', u"In conclusion, the Council invited Coreper to examine
 the technical questions arising from today's debate and to report back to it wi
th a view to deciding on a possible request to the Commission to submit a propos
al in this area. ", u"This technical examination should be carried out, taking i
nto account the criteria indicated in the Commission's Communication for a reduc
ed VAT rate, on the following questions :", u'An initial trial period running un
til the year 2002 should identify the best method for allocating FISIM. At the e
nd of this period, the Commission will assess the results of the trial period an
d decide, by means of a comitology procedure, on the final methodology to be app
lied. However, a unanimous decision by the Council would be needed in order to u
se the new methodology in budgetary calculations on other Community policies and
 notably concerning "own resources".']

0人赞添加讨论(0) 举报

萌系小妹纸

3楼-- · 2019-08-28 15:54

Iterate over p elements and stop when you find a text starting with Commission:

import urllib
from bs4 import BeautifulSoup

url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/5923en8.htm"
soup=BeautifulSoup(urllib.urlopen(url))

for item in soup.find_all('p'):
    if item.text.startswith('Commission'):
        break
    else:
        print item.text

It prints everything before Commission:

The Governments of the Member States and the European Commission were represented as follows:
Belgium:
...
Ms Helen LIDDELL            Economic Secretary to the Treasury
* * *

0人赞添加讨论(0) 举报