I am querying Google Search Engine and it works fine locally by returning the expected results. When the same code is deployed on AppEngine, it returns None 302.
The following program returns the links returned in Google Search results.
# The first two imports will be slightly different when deployed on appengine
from pyquery import PyQuery as pq
import requests
import random
try:
from urllib.parse import quote as url_quote
except ImportError:
from urllib import quote as url_quote
USER_AGENTS = ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100 101 Firefox/22.0',
'Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',
'Mozilla/5.0 (Windows; Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',)
SEARCH_URL = 'https://www.google.com/search?q=site:foobar.com%20{0}'
def get_result(url):
return requests.get(url, headers={'User-Agent': random.choice(USER_AGENTS)}).text
def get_links(query):
result = get_result(SEARCH_URL.format(url_quote(query)))
html = pq(result)
return [a.attrib['href'] for a in html('.l')] or \
[a.attrib['href'] for a in html('.r')('a')]
print get_links('foo bar')
Code deployed on AppEngine:
import sys
sys.path[0:0] = ['distlibs']
import lxml
import webapp2
import json
from requests import api
from pyquery.pyquery import PyQuery as pq
import random
try:
from urllib.parse import quote as url_quote
except ImportError:
from urllib import quote as url_quote
USER_AGENTS = ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100 101 Firefox/22.0',
'Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',
'Mozilla/5.0 (Windows; Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',)
SEARCH_URL = 'https://www.google.com/search?q=site:foobar.com%20{0}'
def get_result(url):
return api.get(url, headers={'User-Agent': random.choice(USER_AGENTS)}).text
def get_links(query):
result = get_result(SEARCH_URL.format(url_quote(query)))
html = pq(result)
return [a.attrib['href'] for a in html('.l')] or \
[a.attrib['href'] for a in html('.r')('a')]
form="""
<form action="/process">
<input name="q">
<input type="submit">
</form>
"""
class MainHandler(webapp2.RequestHandler):
def get(self):
self.response.out.write("<h3>Write something.</h3><br>")
self.response.out.write(form)
class ProcessHandler(webapp2.RequestHandler):
def get(self):
query = self.request.get("q")
self.response.out.write("Your query : " + query)
results = get_links(query)
self.response.out.write(results[0])
app = webapp2.WSGIApplication([('/', MainHandler),
('/process', ProcessHandler)],
debug=True)
I have tried querying with both the http and https protocols. The following is the AppEngine log for a request.
Starting new HTTP connection (1): www.google.com
D 2013-12-21 13:13:37.217
"GET /search?q=site:foobar.com%20foo%20bar HTTP/1.1" 302 None
I 2013-12-21 13:13:37.218
Starting new HTTP connection (1): ipv4.google.com
D 2013-12-21 13:13:37.508
"GET /sorry/IndexRedirect?continue=http://www.google.com/search%3Fq%3Dsite:foobar.com%20foo%20bar HTTP/1.1" 403 None
E 2013-12-21 20:51:32.090
list index out of range
I'm puzzled as to why you're trying to spoof the
User-Agent
header, but it if makes you happy, go for it. Just note that ifrequests.get
is usingurlfetch
under the covers, App Engine appends a string to the User-Agent header your app supplies, identifying your app. (See https://developers.google.com/appengine/docs/python/urlfetch/#Python_Request_headers).Try passing
follow_redirects = False
tourlfetch
. That's how you make requests to other App Engine Apps. For completely non-obvious reasons, it might help you in this case.