BeautifulSoup in Python not parsing right

I am running Python 2.7.5 and using the built-in html parser for what I am about to describe.

The task I am trying to accomplish is to take a chunk of html that is essentially a recipe. Here is an example.

html_chunk = "<h1>Miniature Potato Knishes</h1>Posted by bettyboop50 at recipegoldmine.com May 10, 2001Makes about 42 miniature knishesThese are just yummy for your tummy!3 cups mashed potatoes (about     2 very large potatoes) 2 eggs, slightly beaten 1 large onion, diced 2 tablespoons margarine 1 teaspoon salt (or to taste) 1/8 teaspoon black pepper 3/8 cup Matzoh meal 1 egg yolk, beaten with 1 tablespoon waterPreheat oven to 400 degrees F.Sauté diced onion in a small amount of butter or margarine until golden brown.In medium bowl, combine mashed potatoes, sautéed onion, eggs, margarine, salt, pepper, and Matzoh meal.Form mixture into small balls about the size of a walnut. Brush with egg yolk mixture and place on a well-greased baking sheet and bake for 20 minutes or until well browned."

The goal is to separate out the header, junk, ingredients, instructions, serving, and number of ingredients.

Here is my code that accomplishes that

from bs4 import BeautifulSoup

def list_to_string(list):
   joined = ""
   for item in list:
      joined += str(item)
   return joined

def get_ingredients(soup):
   for p in soup.find_all('p'):
      if p.find('br'):
         return p

def get_instructions(p_list, ingredient_index):
   instructions = []
   instructions += p_list[ingredient_index+1:]
   return instructions

def get_junk(p_list, ingredient_index):
   junk = []
   junk += p_list[:ingredient_index]
   return junk

def get_serving(p_list):
   for item in p_list:
      item_str = str(item).lower()
      if ("yield" or "make" or "serve" or "serving") in item_str:
         yield_index = p_list.index(item)
         del p_list[yield_index]
         return item

def ingredients_count(ingredients):
   ingredients_list = ingredients.find_all(text=True)
   return len(ingredients_list)

def get_header(soup):
   return soup.find('h1')

def html_chunk_splitter(soup):
   ingredients = get_ingredients(soup)
   if ingredients == None:
      error = 1
      header = ""
      junk_string = ""
      instructions_string = ""
      serving = ""
      count = ""
   else:
      p_list = soup.find_all('p')
      serving = get_serving(p_list)
      ingredient_index = p_list.index(ingredients)
      junk_list = get_junk(p_list, ingredient_index)
      instructions_list = get_instructions(p_list, ingredient_index)
      junk_string = list_to_string(junk_list)
      instructions_string = list_to_string(instructions_list)
      header = get_header(soup)
      error = ""
      count = ingredients_count(ingredients)
   return (header, junk_string, ingredients, instructions_string, 
   serving, count, error)

It works well except in situations where I have chunks that contain strings like "Sauté" because soup = BeautifulSoup(html_chunk) causes Sauté to turn into SautÃ© and this is a problem because I have a huge csv file of recipes like the html_chunk and I'm trying to structure all of them nicely and then get the output back into a database. I tried checking it SautÃ© comes out right using this html previewer and it still comes out as SautÃ©. I don't know what to do about this.

What's stranger is that when I do what BeautifulSoup's documentation shows

BeautifulSoup("Sacr&eacute; bleu!")
# <html><head></head><body>Sacré bleu!</body></html>

I get

# Sacr├⌐ bleu!

But my colleague tried that on his Mac, running from terminal, and he got exactly what the documentation shows.

I really appreciate all your help. Thank you.

标签： python html encoding beautifulsoup

2条回答

你好瞎i

2楼-- · 2019-08-05 15:04

BeautifulSoup tries to guess the encoding, sometimes it makes a mistake, however you can specify the encoding by adding the from_encoding parameter: for example

soup = BeautifulSoup(html_text, from_encoding="UTF-8")

The encoding is usually available in the header of the webpage

0人赞添加讨论(0) 举报

你好瞎i

3楼-- · 2019-08-05 15:19

This is not a parsing problem; it is about encoding, rather.

Whenever working with text which might contain non-ASCII characters (or in Python programs which contain such characters, e.g. in comments or docstrings), you should put a coding cookie in the first or - after the shebang line - second line:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

... and make sure this matches your file encoding (with vim: :set fenc=utf-8).

0人赞添加讨论(0) 举报

BeautifulSoup in Python not parsing right

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间