There are few concurrency settings in Scrapy, like CONCURRENT_REQUESTS. Does it mean, that Scrapy crawler is multi-threaded? So if I run scrapy crawl my_crawler
it will literally fire multiple simultaneous requests in parallel?
Im asking because, I've read that Scrapy is single-threaded.
问题:
回答1:
Scrapy is single-threaded, except the interactive shell and some tests, see source.
It's built on top of Twisted, which is single-threaded too, and makes use of it's own asynchronous concurrency capabilities, such as twisted.internet.interfaces.IReactorThreads.callFromThread
, see source.
回答2:
Scrapy does most of it's work synchronously. However, the handling of requests is done asynchronously.
I suggest this page if you haven't already seen it.
http://doc.scrapy.org/en/latest/topics/architecture.html
edit: I realize now the question was about threading and not necessarily whether it's asynchronous or not. That link would still be a good read though :)
regarding your question about CONCURRENT_REQUESTS. This setting changes the number of requests that twisted will defer at once. Once that many requests have been started it will wait for some of them to finish before starting more.
回答3:
Scrapy is single-threaded framework, we cannot use multiple threads within a spider at the same time. However, we can create multiple spiders and piplines at the same time to make the process concurrent.
Scrapy does not support multi-threading
because it is built on Twisted
, which is an Asynchronous http protocol framework
.