BeautifulSoup在Python不正确的解析(BeautifulSoup in Python

我运行的Python 2.7.5，用什么我将要描述的内置HTML解析器。

我试图完成的任务是，以HTML的块，它本质上是一个良方。下面是一个例子。

html_chunk = "<h1>Miniature Potato Knishes</h1>Posted by bettyboop50 at recipegoldmine.com May 10, 2001Makes about 42 miniature knishesThese are just yummy for your tummy!3 cups mashed potatoes (about     2 very large potatoes) 2 eggs, slightly beaten 1 large onion, diced 2 tablespoons margarine 1 teaspoon salt (or to taste) 1/8 teaspoon black pepper 3/8 cup Matzoh meal 1 egg yolk, beaten with 1 tablespoon waterPreheat oven to 400 degrees F.Sauté diced onion in a small amount of butter or margarine until golden brown.In medium bowl, combine mashed potatoes, sautéed onion, eggs, margarine, salt, pepper, and Matzoh meal.Form mixture into small balls about the size of a walnut. Brush with egg yolk mixture and place on a well-greased baking sheet and bake for 20 minutes or until well browned."

我们的目标是分离出标题，垃圾，成分，使用说明，服务，和许多成分。

这里是我的代码，实现了这一

from bs4 import BeautifulSoup

def list_to_string(list):
   joined = ""
   for item in list:
      joined += str(item)
   return joined

def get_ingredients(soup):
   for p in soup.find_all('p'):
      if p.find('br'):
         return p

def get_instructions(p_list, ingredient_index):
   instructions = []
   instructions += p_list[ingredient_index+1:]
   return instructions

def get_junk(p_list, ingredient_index):
   junk = []
   junk += p_list[:ingredient_index]
   return junk

def get_serving(p_list):
   for item in p_list:
      item_str = str(item).lower()
      if ("yield" or "make" or "serve" or "serving") in item_str:
         yield_index = p_list.index(item)
         del p_list[yield_index]
         return item

def ingredients_count(ingredients):
   ingredients_list = ingredients.find_all(text=True)
   return len(ingredients_list)

def get_header(soup):
   return soup.find('h1')

def html_chunk_splitter(soup):
   ingredients = get_ingredients(soup)
   if ingredients == None:
      error = 1
      header = ""
      junk_string = ""
      instructions_string = ""
      serving = ""
      count = ""
   else:
      p_list = soup.find_all('p')
      serving = get_serving(p_list)
      ingredient_index = p_list.index(ingredients)
      junk_list = get_junk(p_list, ingredient_index)
      instructions_list = get_instructions(p_list, ingredient_index)
      junk_string = list_to_string(junk_list)
      instructions_string = list_to_string(instructions_list)
      header = get_header(soup)
      error = ""
      count = ingredients_count(ingredients)
   return (header, junk_string, ingredients, instructions_string, 
   serving, count, error)

它运作良好，除了在那里我有包含像串块的情况下"Sauté" ，因为soup = BeautifulSoup(html_chunk)导致炒变成绍塔©，这是一个问题，因为我有这样的html_chunk和我的食谱一个巨大的csv文件“M试图很好地结构中的所有这些，然后得到输出回数据库。我试图检查它绍塔©出来，立刻使用这个HTML预览，它仍然出来为绍塔©。我不知道该怎么办这个问题。

有什么奇怪的是，当我做BeautifulSoup的文档显示

BeautifulSoup("Sacr&eacute; bleu!")
# <html><head></head><body>Sacré bleu!</body></html>

我得到

# Sacr├⌐ bleu!

但我的同事尝试了他的Mac上，从终端运行，并且他得到了什么文档显示。

我真的很感谢你的帮助。谢谢。

Answer 1:

这不是一个分析问题; 它是关于编码，而。

只要有可能包含非ASCII字符的文本工作（或其中含有此类字符，如在注释或文档字符串Python程序），你应该把编码的cookie在第一或 - 家当行之后 - 第二行：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

...，并确保该文件编码匹配（用vim： :set fenc=utf-8

Answer 2:

BeautifulSoup试图猜测编码，有时犯错，但是你可以通过添加指定编码from_encoding参数：例如

soup = BeautifulSoup(html_text, from_encoding="UTF-8")

该编码通常是在所述网页的标题可用