Why scrapy crawler stops?

2019-05-26 15:47发布

I have written a crawler using scrapy framework to parse a products site. The crawler stops in between suddenly without completing the full parsing process. I have researched a lot on this and most of the answers indicate that my crawler is being blocked by the website. Is there any mechanism by which I can detect whether my spider is being stopped by website or does it stop on its own?

The below is info level log entry of spider .

2013-09-23 09:59:07+0000 [scrapy] INFO: Scrapy 0.18.0 started (bot: crawler)  
2013-09-23 09:59:08+0000 [spider] INFO: Spider opened  
2013-09-23 09:59:08+0000 [spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)  
2013-09-23 10:00:08+0000 [spider] INFO: Crawled 10 pages (at 10 pages/min), scraped 7 items (at 7 items/min)  
2013-09-23 10:01:08+0000 [spider] INFO: Crawled 22 pages (at 12 pages/min), scraped 19 items (at 12 items/min)  
2013-09-23 10:02:08+0000 [spider] INFO: Crawled 31 pages (at 9 pages/min), scraped 28 items (at 9 items/min)  
2013-09-23 10:03:08+0000 [spider] INFO: Crawled 40 pages (at 9 pages/min), scraped 37 items (at 9 items/min)  
2013-09-23 10:04:08+0000 [spider] INFO: Crawled 49 pages (at 9 pages/min), scraped 46 items (at 9 items/min)  
2013-09-23 10:05:08+0000 [spider] INFO: Crawled 59 pages (at 10 pages/min), scraped 56 items (at 10 items/min)  

Below is last part of debug level entry in log file before spider is closed:

2013-09-25 11:33:24+0000 [spider] DEBUG: Crawled (200) <GET http://url.html> (referer: http://site_name)
2013-09-25 11:33:24+0000 [spider] DEBUG: Scraped from <200 http://url.html>

//scrapped data in json form

2013-09-25 11:33:25+0000 [spider] INFO: Closing spider (finished)  
2013-09-25 11:33:25+0000 [spider] INFO: Dumping Scrapy stats:  
    {'downloader/request_bytes': 36754,  
     'downloader/request_count': 103,  
     'downloader/request_method_count/GET': 103,  
     'downloader/response_bytes': 390792,  
     'downloader/response_count': 103,  
     'downloader/response_status_count/200': 102,  
     'downloader/response_status_count/302': 1,  
     'finish_reason': 'finished',  
     'finish_time': datetime.datetime(2013, 9, 25, 11, 33, 25, 1359),  
     'item_scraped_count': 99,  
     'log_count/DEBUG': 310,  
     'log_count/INFO': 14,  
     'request_depth_max': 1,  
     'response_received_count': 102,  
     'scheduler/dequeued': 100,  
     'scheduler/dequeued/disk': 100,  
     'scheduler/enqueued': 100,  
     'scheduler/enqueued/disk': 100,  
     'start_time': datetime.datetime(2013, 9, 25, 11, 23, 3, 869392)}  
2013-09-25 11:33:25+0000 [spider] INFO: Spider closed (finished)  

Still there are pages remaining to be parsed, but the spider stops.

标签: scrapy
1条回答
混吃等死
2楼-- · 2019-05-26 16:36

So far I know that for a spider:

  1. There are some queue or pool of urls to be scraped/parsed with parsing methods. You can specify, bind the url to a specific method or let the default 'parse' do the job.
  2. From parsing methods you must return/yield another request(s), to feed that pool, or item(s)
  3. When the pool runs out of urls or a stop signal is sent the spider stops crawling.

Would be nice if you share your spider code so we can check if those binds are correct. It's easy to miss some bindings by mistake using SgmlLinkExtractor for example.

查看更多
登录 后发表回答