I am trying to parse the document http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/5923en8.htm.
I want to extract everything before Commission:
.
(I need Beautifulsoup because the second step is to extract countries and person names)
If i do:
import urllib
import re
from bs4 import BeautifulSoup
url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/5923en8.htm"
soup=BeautifulSoup(urllib.urlopen(url))
print soup.find_all(text=re.compile("Commission"))
The only result I get is :
[u'The Governments of the Member States and the European Commission were represented as follows:']
This is the first occurrence of the word, but not the line I am looking for. I think it's because the document is not valid, but not sure. If I look at the source code :
<B><U><P>Commission</B></U>:</P>
But if I do a print of soup
, I can see the text, with tags reordered:
<u><b>Commission</b></u>
How can I get this element "Commission:"
?
I am using python 2.7 and Beautifoulsoup 4.3.2.
EDIT: SOLVED!
As suggested alecxe, I replaced the line:
soup=BeautifulSoup(urllib.urlopen(url))
with
BeautifulSoup(urllib.urlopen(url), 'html.parser')
It works now :). Thanks to everyone.
EDIT: similar problems
I have founs similar problems with the same solution:
Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds
If you want everything before the tag with "Commision:" value. You could just do it without beatifulsoup... and just treat it lika a string variable and search for the right keyword and drop the rest of the string.
But when I run your code I get following:
Iterate over
p
elements and stop when you find a text starting withCommission
:It prints everything before
Commission
: