scrape google resultstats with python [closed]

2019-02-23 11:57发布

I would like to get the estimated results number from google for a keyword. Im using Python3.3 and try to accomplish this task with BeautifulSoup and urllib.request. This is my simple code so far

def numResults():
try:
    page_google = '''http://www.google.de/#output=search&sclient=psy-ab&q=pokerbonus&oq=pokerbonus&gs_l=hp.3..0i10l2j0i10i30l2.16503.18949.0.20819.10.9.0.1.1.0.413.2110.2-6j1j1.8.0....0...1c.1.19.psy-ab.FEBvxrgi0KU&pbx=1&bav=on.2,or.r_qf.&bvm=bv.48705608,d.Yms&'''
    req_google = Request(page_google)
    req_google.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
    html_google = urlopen(req_google).read()
    soup = BeautifulSoup(html_google)
    scounttext = soup.find('div', id='resultStats')
except URLError as e:
    print(e)
return scounttext

My problem is that my soup variable is somehow encoded and that i cant get any information out of it. So i get back a None because soup.find doesnt work.

What am i doing wrong and how can i extract the wanted resultstats? Many thanks!

1条回答
Rolldiameter
2楼-- · 2019-02-23 12:09

If you haven't solved this problem yet, it looks like the reason BeautifulSoup is failing to find anything is that the resultStats never appear in the soup - your Request(page_google) is only returning JavaScript, not any search results that the JavaScript is dynamically loading in. You can verify this by adding a

print(soup)

command to your code and you will see that the resultStats div doesn't appear.

The following code:

import sys                                                                                                                                                                  
from urllib2 import Request, urlopen                                                                                                                                        
import urllib                                                                                                                                                               
from bs4 import BeautifulSoup                                                                                                                                               
query = 'pokerbonus'                                                                                                                                                        
url = "http://www.google.de/search?q=%s" % urllib.quote_plus(query)                                                                                                         
req_google = Request(url)                                                                                                                                                   
req_google.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;    rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3')                                           
html_google = urlopen(req_google).read()                                                                                                                                    
soup = BeautifulSoup(html_google)                                                                                                                                           
scounttext = soup.find('div', id='resultStats')                                                                                                                             
print(scounttext)

Will print

<div class="sd" id="resultStats">Ungefähr 1.060.000 Ergebnisse</div>

Lastly, using a tool like Selenium Webdriver might be a better way to go about solving this, as Google does not allow bots to scrape search results.

查看更多
登录 后发表回答