I want to asynchronously query a database for keys, then make requests to several urls for each key.
I have a function that returns a Deferred
from the database whose value is the key for several requests. Ideally, I would call this function and return a generator of Deferreds from start_requests
.
@inlineCallbacks
def get_request_deferred(self):
d = yield engine.execute(select([table])) # async
d.addCallback(make_url)
d.addCallback(Request)
return d
def start_requests(self):
????
But attempting this in several ways raises
builtins.AttributeError: 'Deferred' object has no attribute 'dont_filter'
which I take to mean that start_requests
must return Request
objects, not Deferreds whose values are Request
objects. The same seems to be true of spider middleware's process_start_requests()
.
Alternatively, I can make initial requests to, say, http://localhost/
and change them to the real url once the key is available from the database through downloader middleware's process_request()
. However, process_request
only returns a Request
object; it cannot yield Requests to multiple pages using the key: attempting yield Request(url)
raises
AssertionError: Middleware myDownloaderMiddleware.process_request
must return None, Response or Request, got generator
What is the cleanest solution to
- get key asynchronously from database
- for each key, generate several requests
You've provided no use case for async database queries to be a necessity. I'm assuming you cannot begin to scrape your URLs unless you query the database first? If that's the case then you're better off just doing the query synchronously, iterate over the query results, extract what you need, then yield Request
objects. It makes little sense to query a db asynchronously and just sit around waiting for the query to finish.
You can let the callback for the Deferred object pass the urls to a generator of some sort. The generator will then convert any received urls into scrapy Request objects and yield them. Below is an example using the code you linked (not tested):
import scrapy
from Queue import Queue
from pdb import set_trace as st
from twisted.internet.defer import Deferred, inlineCallbacks
class ExampleSpider(scrapy.Spider):
name = 'example'
def __init__(self):
self.urls = Queue()
self.stop = False
self.requests = request_generator()
self.deferred = deferred_generator()
def deferred_generator(self):
d = Deferred()
d.addCallback(self.deferred_callback)
yield d
def request_generator(self):
while not self.stop:
url = self.urls.get()
yield scrapy.Request(url=url, callback=self.parse)
def start_requests(self):
return self.requests.next()
def parse(self, response):
st()
# when you need to parse the next url from the callback
yield self.requests.next()
@static_method
def deferred_callback(url):
self.urls.put(url)
if no_more_urls():
self.stop = True
Don't forget to stop the request generator when you're done.