Why scrapy crawler stops?

I have written a crawler using scrapy framework to parse a products site. The crawler stops in between suddenly without completing the full parsing process. I have researched a lot on this and most of the answers indicate that my crawler is being blocked by the website. Is there any mechanism by which I can detect whether my spider is being stopped by website or does it stop on its own?

The below is info level log entry of spider .

2013-09-23 09:59:07+0000 [scrapy] INFO: Scrapy 0.18.0 started (bot: crawler)  
2013-09-23 09:59:08+0000 [spider] INFO: Spider opened  
2013-09-23 09:59:08+0000 [spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)  
2013-09-23 10:00:08+0000 [spider] INFO: Crawled 10 pages (at 10 pages/min), scraped 7 items (at 7 items/min)  
2013-09-23 10:01:08+0000 [spider] INFO: Crawled 22 pages (at 12 pages/min), scraped 19 items (at 12 items/min)  
2013-09-23 10:02:08+0000 [spider] INFO: Crawled 31 pages (at 9 pages/min), scraped 28 items (at 9 items/min)  
2013-09-23 10:03:08+0000 [spider] INFO: Crawled 40 pages (at 9 pages/min), scraped 37 items (at 9 items/min)  
2013-09-23 10:04:08+0000 [spider] INFO: Crawled 49 pages (at 9 pages/min), scraped 46 items (at 9 items/min)  
2013-09-23 10:05:08+0000 [spider] INFO: Crawled 59 pages (at 10 pages/min), scraped 56 items (at 10 items/min)

Below is last part of debug level entry in log file before spider is closed:

2013-09-25 11:33:24+0000 [spider] DEBUG: Crawled (200) <GET http://url.html> (referer: http://site_name)
2013-09-25 11:33:24+0000 [spider] DEBUG: Scraped from <200 http://url.html>

//scrapped data in json form

2013-09-25 11:33:25+0000 [spider] INFO: Closing spider (finished)  
2013-09-25 11:33:25+0000 [spider] INFO: Dumping Scrapy stats:  
    {'downloader/request_bytes': 36754,  
     'downloader/request_count': 103,  
     'downloader/request_method_count/GET': 103,  
     'downloader/response_bytes': 390792,  
     'downloader/response_count': 103,  
     'downloader/response_status_count/200': 102,  
     'downloader/response_status_count/302': 1,  
     'finish_reason': 'finished',  
     'finish_time': datetime.datetime(2013, 9, 25, 11, 33, 25, 1359),  
     'item_scraped_count': 99,  
     'log_count/DEBUG': 310,  
     'log_count/INFO': 14,  
     'request_depth_max': 1,  
     'response_received_count': 102,  
     'scheduler/dequeued': 100,  
     'scheduler/dequeued/disk': 100,  
     'scheduler/enqueued': 100,  
     'scheduler/enqueued/disk': 100,  
     'start_time': datetime.datetime(2013, 9, 25, 11, 23, 3, 869392)}  
2013-09-25 11:33:25+0000 [spider] INFO: Spider closed (finished)

Still there are pages remaining to be parsed, but the spider stops.

标签： scrapy

1条回答

混吃等死

2楼-- · 2019-05-26 16:36

So far I know that for a spider:

There are some queue or pool of urls to be scraped/parsed with parsing methods. You can specify, bind the url to a specific method or let the default 'parse' do the job.

From parsing methods you must return/yield another request(s), to feed that pool, or item(s)

When the pool runs out of urls or a stop signal is sent the spider stops crawling.

Would be nice if you share your spider code so we can check if those binds are correct. It's easy to miss some bindings by mistake using SgmlLinkExtractor for example.

0人赞添加讨论(0) 举报

Why scrapy crawler stops?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间