-->

Fetching a lot of urls in python with google app e

2019-09-17 10:06发布

问题:

In my subclass of RequestHandler, I am trying to fetch range of urls:

class GetStats(webapp2.RequestHandler):
    def post(self): 

    lastpage = 50   
    for page in range(1, lastpage):
        tmpurl = url + str(page)
        response = urllib2.urlopen(tmpurl, timeout=5)
        html = response.read()
        # some parsing html
        heap.append(result_of_parsing)  

    self.response.write(heap)

But it works with ~ 30 urls (page is loading long but it is works). In case more than 30 I am getting an error:

Error: Server Error

The server encountered an error and could not complete your request.

Please try again in 30 seconds.

Is there any way to fetch a lot of urls? May be more optimal or smth? Up to several hundreds of pages?

Update:

I am using BeautifulSoup to parse every single page. I found this traceback in gae logs:

Traceback (most recent call last):
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 267, in Handle
result = handler(dict(self._environ), self._StartResponse)
  File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
  File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
  File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
return handler.dispatch()
  File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
  File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 68, in post
heap = get_times(tmp_url, 160)
  File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 106, in get_times
soup = BeautifulSoup(html)
  File "libs/bs4/__init__.py", line 168, in __init__
self._feed()
  File "libs/bs4/__init__.py", line 181, in _feed
self.builder.feed(self.markup)
  File "libs/bs4/builder/_htmlparser.py", line 56, in feed
super(HTMLParserTreeBuilder, self).feed(markup)
  File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 114, in feed
self.goahead(0)
  File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 155, in goahead
startswith = rawdata.startswith
 DeadlineExceededError

回答1:

It's failing because you only have 60 seconds to return a response to the user and I'm going to guess it's taking longer then that.

You will want to use this: https://cloud.google.com/appengine/articles/deferred

to create a task that has a 10 minute time out. Then you can return instantly to the user and they can "pick up" the results at a later time via another handler (that you create). If collecting all the URLs takes longer then 10 minutes you'll have to split them up into further tasks.

See this: https://cloud.google.com/appengine/articles/deadlineexceedederrors

to understand why you cannot go longer then 60 seconds.



回答2:

Edit: Might come from Appengine quotas and limits. Sorry for previous answer:

As this looks like a protection from server for avoiding ddos or scrapping from one client. You have few options:

  • Waiting between a certain number of queries before continuing.

  • Making request from several clients who has different IP address and sending information back to your main script (might be costly to rent different server for this..).

  • You could also watch if website as api to access the data you need.

You should also take care as the sitowner could block/blacklist your IP if he decides your request are not good.