Google search returns None 302 on AppEngine

2019-09-02 03:26发布

问题:

I am querying Google Search Engine and it works fine locally by returning the expected results. When the same code is deployed on AppEngine, it returns None 302.

The following program returns the links returned in Google Search results.

# The first two imports will be slightly different when deployed on appengine
from pyquery import PyQuery as pq
import requests
import random
try:
    from urllib.parse import quote as url_quote
except ImportError:
    from urllib import quote as url_quote

USER_AGENTS = ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
               'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100 101 Firefox/22.0',
               'Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0',
               'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',
               'Mozilla/5.0 (Windows; Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',)


SEARCH_URL = 'https://www.google.com/search?q=site:foobar.com%20{0}'

def get_result(url):
    return requests.get(url, headers={'User-Agent': random.choice(USER_AGENTS)}).text


def get_links(query):
    result = get_result(SEARCH_URL.format(url_quote(query)))
    html = pq(result)
    return [a.attrib['href'] for a in html('.l')] or \
        [a.attrib['href'] for a in html('.r')('a')]

print get_links('foo bar')

Code deployed on AppEngine:

import sys
sys.path[0:0] = ['distlibs']

import lxml
import webapp2
import json
from requests import api
from pyquery.pyquery import PyQuery as pq
import random

try:
    from urllib.parse import quote as url_quote
except ImportError:
    from urllib import quote as url_quote


USER_AGENTS = ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
               'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100 101 Firefox/22.0',
               'Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0',
               'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',
               'Mozilla/5.0 (Windows; Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',)


SEARCH_URL = 'https://www.google.com/search?q=site:foobar.com%20{0}'



def get_result(url):
    return api.get(url, headers={'User-Agent': random.choice(USER_AGENTS)}).text


def get_links(query):
    result = get_result(SEARCH_URL.format(url_quote(query)))
    html = pq(result)
    return [a.attrib['href'] for a in html('.l')] or \
        [a.attrib['href'] for a in html('.r')('a')]


form="""
<form action="/process">
    <input name="q">
    <input type="submit">
</form>
"""


class MainHandler(webapp2.RequestHandler):
    def get(self):
        self.response.out.write("<h3>Write something.</h3><br>")
        self.response.out.write(form)


class ProcessHandler(webapp2.RequestHandler):
    def get(self):
        query = self.request.get("q")
        self.response.out.write("Your query : " + query)
        results = get_links(query)
        self.response.out.write(results[0])



app = webapp2.WSGIApplication([('/', MainHandler),
                               ('/process', ProcessHandler)],
                               debug=True)

I have tried querying with both the http and https protocols. The following is the AppEngine log for a request.

Starting new HTTP connection (1): www.google.com
D 2013-12-21 13:13:37.217
"GET /search?q=site:foobar.com%20foo%20bar HTTP/1.1" 302 None
I 2013-12-21 13:13:37.218
Starting new HTTP connection (1): ipv4.google.com
D 2013-12-21 13:13:37.508
"GET /sorry/IndexRedirect?continue=http://www.google.com/search%3Fq%3Dsite:foobar.com%20foo%20bar HTTP/1.1" 403 None
E 2013-12-21 20:51:32.090
list index out of range

回答1:

I'm puzzled as to why you're trying to spoof the User-Agent header, but it if makes you happy, go for it. Just note that if requests.get is using urlfetch under the covers, App Engine appends a string to the User-Agent header your app supplies, identifying your app. (See https://developers.google.com/appengine/docs/python/urlfetch/#Python_Request_headers).

Try passing follow_redirects = False to urlfetch. That's how you make requests to other App Engine Apps. For completely non-obvious reasons, it might help you in this case.