美丽的汤的findAll不容易找到他们所有(Beautiful Soup findAll doesn

2019-09-01 16:08发布

我试图解析网站，并获得与BeautifulSoup.findAll一些信息，但它并没有找到他们。我正在使用python3

代码是这样的

#!/usr/bin/python3

from bs4 import BeautifulSoup
from urllib.request import urlopen

page = urlopen ("http://mangafox.me/directory/")
# print (page.read ())
soup = BeautifulSoup (page.read ())

manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)

for manga in manga_img:
    print (manga['href'])

只打印其中的一半......

Answer 1:

不同的HTML解析器破碎的HTML处理不同。该页面提供HTML破碎和lxml解析器不处理它非常好：

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> len(soup.find_all('a', class_='manga_img'))
18

标准库html.parser有这特定网页，少一些麻烦：

>>> soup = BeautifulSoup(r.content, 'html.parser')
>>> len(soup.find_all('a', class_='manga_img'))
44

翻译，为您的具体使用的代码示例urllib ，你将因此指定解析器：

soup = BeautifulSoup(page, 'html.parser')  # BeatifulSoup can do the reading

文章来源: Beautiful Soup findAll doesn't find them all

美丽的汤的findAll不容易找到他们所有(Beautiful Soup findAll doesn

Answer 1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮