-->

Async query database for keys to use in multiple r

2020-08-08 04:47发布

问题:

I want to asynchronously query a database for keys, then make requests to several urls for each key.

I have a function that returns a Deferred from the database whose value is the key for several requests. Ideally, I would call this function and return a generator of Deferreds from start_requests.

@inlineCallbacks
def get_request_deferred(self):

   d = yield engine.execute(select([table])) # async
   d.addCallback(make_url)
   d.addCallback(Request)
   return d

def start_requests(self):
    ????

But attempting this in several ways raises

builtins.AttributeError: 'Deferred' object has no attribute 'dont_filter'

which I take to mean that start_requests must return Request objects, not Deferreds whose values are Request objects. The same seems to be true of spider middleware's process_start_requests().

Alternatively, I can make initial requests to, say, http://localhost/ and change them to the real url once the key is available from the database through downloader middleware's process_request(). However, process_request only returns a Request object; it cannot yield Requests to multiple pages using the key: attempting yield Request(url) raises

AssertionError: Middleware myDownloaderMiddleware.process_request
must return None, Response or Request, got generator

What is the cleanest solution to

  • get key asynchronously from database
  • for each key, generate several requests

回答1:

You've provided no use case for async database queries to be a necessity. I'm assuming you cannot begin to scrape your URLs unless you query the database first? If that's the case then you're better off just doing the query synchronously, iterate over the query results, extract what you need, then yield Request objects. It makes little sense to query a db asynchronously and just sit around waiting for the query to finish.



回答2:

You can let the callback for the Deferred object pass the urls to a generator of some sort. The generator will then convert any received urls into scrapy Request objects and yield them. Below is an example using the code you linked (not tested):

import scrapy
from Queue import Queue
from pdb import set_trace as st
from twisted.internet.defer import Deferred, inlineCallbacks


class ExampleSpider(scrapy.Spider):
    name = 'example'

    def __init__(self):
        self.urls = Queue()
        self.stop = False
        self.requests = request_generator()
        self.deferred = deferred_generator()

    def deferred_generator(self):
        d = Deferred()
        d.addCallback(self.deferred_callback)
        yield d

    def request_generator(self):
        while not self.stop:
            url = self.urls.get()
            yield scrapy.Request(url=url, callback=self.parse)

    def start_requests(self):
        return self.requests.next()

    def parse(self, response):
        st()

        # when you need to parse the next url from the callback
        yield self.requests.next()

    @static_method
    def deferred_callback(url):
        self.urls.put(url)
        if no_more_urls():
            self.stop = True

Don't forget to stop the request generator when you're done.