So when I decided to parse content from a website. For example, http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx
I want to parse the ingredients into a text file. The ingredients are located in:
< div class="ingredients"
style="margin-top: 10px;">
and within this, each ingredient is stored between
< li class="plaincharacterwrap">
Someone was nice enough to provide code using regex, but it gets confusing when you are modyfying from site to site. So I wanted to use Beautiful Soup since it has a lot of built in features. Except I can confused on how to actually do it.
Code:
import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen("http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx")
soup = BeautifulSoup(html)
try:
ingrdiv = soup.find('div', attrs={'class': 'ingredients'})
except IOError:
print 'IO error'
Is this kind of how you get started? I want to find the actual div class and then parse out all those ingredients located within the li class.
Any help would be appreciated! Thanks!
import urllib2
import BeautifulSoup
def main():
url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
data = urllib2.urlopen(url).read()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
fname = 'PorkChopsRecipe.txt'
with open(fname, 'w') as outf:
outf.write('\n'.join(ingreds))
if __name__=="__main__":
main()
results in
1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste
.
Follow-up response to @eyquem:
from time import clock
import urllib
import re
import BeautifulSoup
import lxml.html
start = clock()
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
print "Loading took", (clock()-start), "s"
# by regex
start = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
print "Regex parse took", (clock()-start), "s"
# by BeautifulSoup
start = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
print "BeautifulSoup parse took", (clock()-start), "s - same =", (res2==res1)
# by lxml
start = clock()
lx = lxml.html.fromstring(data)
ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
res3 = '\n'.join(s.strip() for s in ingreds)
print "lxml parse took", (clock()-start), "s - same =", (res3==res1)
gives
Loading took 1.09091222621 s
Regex parse took 0.000432703726233 s
BeautifulSoup parse took 0.28126133314 s - same = True
lxml parse took 0.0100940499505 s - same = True
Regex is much faster (except when it's wrong); but if you consider loading the page and parsing it together, BeautifulSoup is still only 20% of the runtime. If you are terribly concerned about speed, I recommend lxml instead.
Yes , a special regex pattern must be written for every site.
But I think that
1- the treatments done with Beautiful Soup must be adapted to every site, too.
2- regexes are not so complicated to write, and with a little habit, it can be done quickly
I am curious to see what kind of treatments must be done with Beautiful Soup to obtain the same results that I obtained in a few minutes. Once upon a time, I tried to learn beautiful Soup but I didn't undesrtand anything to this mess. I should try again, now I am a little more skilled in Python. But regexes have been OK and sufficient for me until now
Here's the code for this new site:
import urllib
import re
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
sock = urllib.urlopen(url)
ch = sock.read()
sock.close()
x = ch.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
print '\n'.join(patingr.findall(ch,x))
.
EDIT
I downloaded and installed BeautifulSoup and ran a comparison with regex.
I don't think I did any error in my comparison code
import urllib
import re
from time import clock
import BeautifulSoup
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
te = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
t1 = clock()-te
te = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
res2 = '\n'.join(ingreds)
t2 = clock()-te
print res1
print
print res2
print
print 'res1==res2 is ',res1==res2
print '\nRegex :',t1
print '\nBeautifulSoup :',t2
print '\nBeautifulSoup execution time / Regex execution time ==',t2/t1
result
1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste
1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste
res1==res2 is True
Regex : 0.00210892725193
BeautifulSoup : 2.32453566026
BeautifulSoup execution time / Regex execution time == 1102.23605776
No comment !
.
EDIT 2
I realized that in my code I don't use a regex, I employ a method that use a regex and find().
It's the method I use when I resort to regexes because it raises the speed of treatment in some cases. It is due to the function find() that runs extremly rapidly.
To know what we are comparing, we need the following codes.
In the code 3 and 4, I took account of remarks of Achim in another thread of posts: using re.IGNORECASE and re.DOTALL, ["\'] instead of ".
These codes are separated because they must be executed in different files to obtain reliable results: I don't know why, but if all the codes are executed in the same file ,certain resulting times are strongly different (0.00075 instead of 0.0022 for exemple)
import urllib
import re
import BeautifulSoup
from time import clock
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
# Simple regex , without x
te = clock()
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res0 = '\n'.join(patingr.findall(data))
t0 = clock()-te
print '\nSimple regex , without x :',t0
and
# Simple regex , with x
te = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
t1 = clock()-te
print '\nSimple regex , with x :',t1
and
# Regex with flags , without x and y
te = clock()
patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
flags=re.DOTALL|re.IGNORECASE)
res10 = '\n'.join(patingr.findall(data))
t10 = clock()-te
print '\nRegex with flags , without x and y :',t10
and
# Regex with flags , with x and y
te = clock()
x = data.find('Ingredients</h3>')
y = data.find('h3>\r\n Footnotes</h3>\r\n')
patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
flags=re.DOTALL|re.IGNORECASE)
res11 = '\n'.join(patingr.findall(data,x,y))
t11 = clock()-te
print '\nRegex with flags , without x and y :',t11
and
# BeautifulSoup
te = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
res2 = '\n'.join(ingreds)
t2 = clock()-te
print '\nBeautifulSoup :',t2
result
Simple regex , without x : 0.00230488284125
Simple regex , with x : 0.00229121279385
Regex with flags , without x and y : 0.00758719458758
Regex with flags , with x and y : 0.00183724493364
BeautifulSoup : 2.58728860791
The use of x has no influence on the speed for a simple regex.
The regex with flags , without x and y, takes longer to execute , but the result isn't the same as the others, because it catches a supplementary chunk of text. That's why in a real application, it would be the regex with flags and x/y that should be used.
The more complicated regex with flags and with x and y takes 20 % of time less.
Well, the results are not very much changed, with or without x/y.
So my conclusion is the same
the use of a regex, resorting to
find() or not, remains roughly 1000 times faster than BeautifulSoup,
and I estimate 100 times faster than
lxml (I didn't installed lxml)
.
To what you wrote, Hugh, I would say:
When a regex is wrong, it is not faster nor slower. It doesn't run.
When a regex is wrong, the coder makes it becoming right, that's all.
I don't understand why 95% of the persons on stackoverflow.com want to persuade other 5% that regexes must not be employed to analyse HTML or XML or anything else. I say "analyse", not "parse". As far as I understood it, a parser first analyse the WHOLE of a text and then displays the content of elements that we want. On the contrary, a regex goes right to what is searched, it doesn't build the tree of the HTML/XML text or whatever else a parser does and that I don't know very well.
So, I am very satisfied of regexes. I have no problem to write even very long REs, and regexes allow me to run programs that must react rapidly after the analyse of a text. BS or lxml would work but that would be a hassle.
I would have other comments to do , but I have no time for a subject in which, in fact, I let others to do as they prefer.