Scrapy doesn't seem to be doing DFO

2019-04-06 04:29发布

I have a website for which my crawler needs to follow a sequence. So for example, it needs to go a1, b1, c1 before it starts going a2 etc. each of a, b and c are handled by different parse functions and the corresponding urls are created in a Request object and yielded. The following roughly illustrates the code I'm using:

class aspider(BaseSpider):

    def parse(self,response):
        yield Request(b, callback=self.parse_b, priority=10)

    def parse_b(self,response):
        yield Request(c, callback=self.parse_c, priority=20)

    def parse_c(self,response)
        final_function()        

However, I find that the sequence of crawls seem to be a1,a2,a3,b1,b2,b3,c1,c2,c3 which is strange since I thought Scrapy is supposed to guarantee depth first.

The sequence doesn't have to be strict, but the site I'm scraping has a limit in place so Scrapy need to start scraping level c as soon as it can before 5 of level bs get crawled. How can this be achieved?

3条回答
淡お忘
2楼-- · 2019-04-06 05:09

Scrapy use DFO by default. The reason of the sequence of crawls is that scrapy crawls pages asynchronously. Even though it use DFO, the sequence seems in unreasonable order because of network delay or something else.

查看更多
冷血范
3楼-- · 2019-04-06 05:12

Depth first searching is exactly what you are describing:

search as deep into a's as possible before moving to b's

To change Scrapy to do breadth-first searching (a1, b1, c1, a2, etc...), change these settings:

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

*Found in the doc.scrapy.org FAQ

查看更多
小情绪 Triste *
4楼-- · 2019-04-06 05:14

I believe that you are noticing the difference between depth-first and breadth-first searching algorithms (see Wikipedia for info on both.)

Scrapy has the ability to change which algorithm is used:

"By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:"

See http://doc.scrapy.org/en/0.14/faq.html for more information.

查看更多
登录 后发表回答