beautifulsoup find_all bug?

2019-03-02 06:56发布

Nowadays I am using beautiful soup to parse the html page. But sometimes the result I got by find_all is less than the number in pages. For example, this page http://www.totallyfreestuff.com/index.asp?m=0&sb=1&p=5 has 18 headline span. But when i use the following codes, it just got two! Can anybody tell me why. Thank you in advance!

soup = BeautifulSoup(page, 'html.parser')
hrefDivList = soup.find_all("span", class_ = "headline")
#print hrefDivList
print len(hrefDivList)

标签： beautifulsoup findall

2条回答

不美不萌又怎样

2楼-- · 2019-03-02 07:21

You can try using different parser for Beautifulsoup.

import requests
from bs4 import BeautifulSoup

url = "<your url>"
r = requests.get(url)

soup = BeautifulSoup(r.content, 'lxml')
hrefDivList = soup.find_all("span", attrs={"class": "headline"})
print len(hrefDivList)

0人赞添加讨论(0) 举报

Viruses.

3楼-- · 2019-03-02 07:31

You can try CSS Selectors to make your life easier

hrefDivList = soup.select("span.headline")
#print hrefDivList
print len(hrefDivList)

Or you can directly iterate over every Span text

for every_span in soup.select("span.headline"):
    print(every_span.text)

0人赞添加讨论(0) 举报

beautifulsoup find_all bug?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间