beautifulsoup find_all bug?

2019-03-02 07:08发布

问题:

Nowadays I am using beautiful soup to parse the html page. But sometimes the result I got by find_all is less than the number in pages. For example, this page http://www.totallyfreestuff.com/index.asp?m=0&sb=1&p=5 has 18 headline span. But when i use the following codes, it just got two! Can anybody tell me why. Thank you in advance!

soup = BeautifulSoup(page, 'html.parser')
hrefDivList = soup.find_all("span", class_ = "headline")
#print hrefDivList
print len(hrefDivList)

回答1:

You can try using different parser for Beautifulsoup.

import requests
from bs4 import BeautifulSoup

url = "<your url>"
r = requests.get(url)

soup = BeautifulSoup(r.content, 'lxml')
hrefDivList = soup.find_all("span", attrs={"class": "headline"})
print len(hrefDivList)


回答2:

You can try CSS Selectors to make your life easier

hrefDivList = soup.select("span.headline")
#print hrefDivList
print len(hrefDivList)

Or you can directly iterate over every Span text

for every_span in soup.select("span.headline"):
    print(every_span.text)