BeautifulSoup, where are you putting my HTML?

2020-02-07 06:34发布

问题:

I'm using BS4 with python2.7. Here's the start of my code (Thanks root):

from bs4 import BeautifulSoup
import urllib2

f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)

When I print html, its contents are the same as the source of the page viewed in chrome. When I print soup however, it cuts out all the entire body and leaves me with this (the contents of the head tag):

<!DOCTYPE html>

<html>
<head>
<title>Browse Movie - YIFY Torrents</title>
<meta charset="utf-8">
<meta content="IE=9" http-equiv="X-UA-Compatible"/>
<meta content="YIFY-Torrents.com - The official YIFY Torrents website. Here you will be able to browse and download all YIFY rip movies in excellent DVD, 720p, 1080p and 3D quality, all at the smallest file size." name="description"/>
<meta content="torrents, yify, movies, movie, download, 720p, 1080p, 3D, browse movies, yify-torrents" name="keywords"/>
<link href="http://static.yify-torrents.com/yify.ico" rel="shortcut icon"/>
<link href="http://yify-torrents.com/rss" rel="alternate" title="YIFY-Torrents RSS feed" type="application/rss+xml"/>
<link href="http://static.yify-torrents.com/assets/css/styles.css?1353330463" rel="stylesheet" type="text/css"/>
<link href="http://static.yify-torrents.com/assets/css/colorbox.css?1327223987" rel="stylesheet" type="text/css"/>
<script src="http://static.yify-torrents.com/assets/js/jquery-1.6.1.min.js?1327224013" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.validate.min.js?1327224011" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.colorbox-min.js?1327224010" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/form.js?1349683447" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/common.js?1353399801" type="text/javascript"></script>
<script>
        var webRoot = 'http://yify-torrents.com/';
        var IsLoggedIn = 0  </script>
<!--[if !IE]><!--><style type="text/css">#content input.field:focus, #content textarea:focus{border: 1px solid #47bc15 !important;}</style></meta></head></html> 

Where am I going wrong?!

回答1:

I had the same problem and this solved my problem:

soup = BeautifulSoup(html, 'html5lib')

You need to install html5lib:

pip install html5lib

or

easy_install html5lib

You can read more about different parsers (pros and cons) for Beautiful Soup here:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser