I am running Python 2.7.5 and using the built-in html parser for what I am about to describe.
The task I am trying to accomplish is to take a chunk of html that is essentially a recipe. Here is an example.
html_chunk = "<h1>Miniature Potato Knishes</h1><p>Posted by bettyboop50 at recipegoldmine.com May 10, 2001</p><p>Makes about 42 miniature knishes</p><p>These are just yummy for your tummy!</p><p>3 cups mashed potatoes (about<br> 2 very large potatoes)<br>2 eggs, slightly beaten<br>1 large onion, diced<br>2 tablespoons margarine<br>1 teaspoon salt (or to taste)<br>1/8 teaspoon black pepper<br>3/8 cup Matzoh meal<br>1 egg yolk, beaten with 1 tablespoon water</p><p>Preheat oven to 400 degrees F.</p><p>Sauté diced onion in a small amount of butter or margarine until golden brown.</p><p>In medium bowl, combine mashed potatoes, sautéed onion, eggs, margarine, salt, pepper, and Matzoh meal.</p><p>Form mixture into small balls about the size of a walnut. Brush with egg yolk mixture and place on a well-greased baking sheet and bake for 20 minutes or until well browned.</p>"
The goal is to separate out the header, junk, ingredients, instructions, serving, and number of ingredients.
Here is my code that accomplishes that
from bs4 import BeautifulSoup
def list_to_string(list):
joined = ""
for item in list:
joined += str(item)
return joined
def get_ingredients(soup):
for p in soup.find_all('p'):
if p.find('br'):
return p
def get_instructions(p_list, ingredient_index):
instructions = []
instructions += p_list[ingredient_index+1:]
return instructions
def get_junk(p_list, ingredient_index):
junk = []
junk += p_list[:ingredient_index]
return junk
def get_serving(p_list):
for item in p_list:
item_str = str(item).lower()
if ("yield" or "make" or "serve" or "serving") in item_str:
yield_index = p_list.index(item)
del p_list[yield_index]
return item
def ingredients_count(ingredients):
ingredients_list = ingredients.find_all(text=True)
return len(ingredients_list)
def get_header(soup):
return soup.find('h1')
def html_chunk_splitter(soup):
ingredients = get_ingredients(soup)
if ingredients == None:
error = 1
header = ""
junk_string = ""
instructions_string = ""
serving = ""
count = ""
else:
p_list = soup.find_all('p')
serving = get_serving(p_list)
ingredient_index = p_list.index(ingredients)
junk_list = get_junk(p_list, ingredient_index)
instructions_list = get_instructions(p_list, ingredient_index)
junk_string = list_to_string(junk_list)
instructions_string = list_to_string(instructions_list)
header = get_header(soup)
error = ""
count = ingredients_count(ingredients)
return (header, junk_string, ingredients, instructions_string,
serving, count, error)
It works well except in situations where I have chunks that contain strings like "Sauté"
because soup = BeautifulSoup(html_chunk)
causes Sauté to turn into Sauté and this is a problem because I have a huge csv file of recipes like the html_chunk and I'm trying to structure all of them nicely and then get the output back into a database. I tried checking it Sauté comes out right using this html previewer and it still comes out as Sauté. I don't know what to do about this.
What's stranger is that when I do what BeautifulSoup's documentation shows
BeautifulSoup("Sacré bleu!")
# <html><head></head><body>Sacré bleu!</body></html>
I get
# Sacré bleu!
But my colleague tried that on his Mac, running from terminal, and he got exactly what the documentation shows.
I really appreciate all your help. Thank you.
BeautifulSoup tries to guess the encoding, sometimes it makes a mistake, however you can specify the encoding by adding the
from_encoding
parameter: for exampleThe encoding is usually available in the header of the webpage
This is not a parsing problem; it is about encoding, rather.
Whenever working with text which might contain non-ASCII characters (or in Python programs which contain such characters, e.g. in comments or docstrings), you should put a coding cookie in the first or - after the shebang line - second line:
... and make sure this matches your file encoding (with vim:
:set fenc=utf-8
).