How to stop scrapy spider after certain number of

2019-03-16 18:45发布

I am developing an simple scraper to get 9 gag posts and its images but due to some technical difficulties iam unable to stop the scraper and it keeps on scraping which i dont want.I want to increase the counter value and stop after 100 posts. But the 9gag page was designed in a fashion in each response it gives only 10 posts and after each iteration my counter value resets to 10 in this case my loop runs infintely long and never stops.


# -*- coding: utf-8 -*-
import scrapy
from _9gag.items import GagItem

class FirstSpider(scrapy.Spider):
    name = "first"
    allowed_domains = ["9gag.com"]
    start_urls = (
        'http://www.9gag.com/',
    )

    last_gag_id = None
    def parse(self, response):
        count = 0
        for article in response.xpath('//article'):
            gag_id = article.xpath('@data-entry-id').extract()
            count +=1
            if gag_id:
                if (count != 100):
                    last_gag_id = gag_id[0]
                    ninegag_item = GagItem()
                    ninegag_item['entry_id'] = gag_id[0]
                    ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
                    ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
                    ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
                    ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
                    ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()

                    yield ninegag_item


                else:
                    break


        next_url = 'http://9gag.com/?id=%s&c=200' % last_gag_id
        yield scrapy.Request(url=next_url, callback=self.parse) 
        print count

Code for items.py is here

from scrapy.item import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()

So i want to increase a global count value and tried this by passing 3 arguments to parse function it gives error

TypeError: parse() takes exactly 3 arguments (2 given)

So is there a way to pass a global count value and return it after each iteration and stop after 100 posts(suppose).

Entire project is available here Github Even if i set POST_LIMIT =100 the infinite loop happens,see here command i executed

scrapy crawl first -s POST_LIMIT=10 --output=output.json

5条回答
小情绪 Triste *
2楼-- · 2019-03-16 18:59

count is local to the parse() method so it's not preserved between pages. Change all occurences of count to self.count to make it an instance variable of the class and it will persist betwen pages.

查看更多
Summer. ? 凉城
3楼-- · 2019-03-16 19:07

One can use custom_settings with CLOSESPIDER_PAGECOUNT as shown below.

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()


class FirstSpider(scrapy.Spider):

    name = "first"
    allowed_domains = ["9gag.com"]
    start_urls = ('http://www.9gag.com/', )
    last_gag_id = None

    COUNT_MAX = 30

    custom_settings = {
        'CLOSESPIDER_PAGECOUNT': COUNT_MAX
    }

    def parse(self, response):

        for article in response.xpath('//article'):
            gag_id = article.xpath('@data-entry-id').extract()
            ninegag_item = GagItem()
            ninegag_item['entry_id'] = gag_id[0]
            ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
            ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
            ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
            self.last_gag_id = gag_id[0]
            yield ninegag_item

            next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id
            yield scrapy.Request(url=next_url, callback=self.parse)
查看更多
迷人小祖宗
4楼-- · 2019-03-16 19:10

First: Use self.count and initialize outside of parse. Then don't prevent the parsing of the items, but generating new requests. See the following code:

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()


class FirstSpider(scrapy.Spider):

    name = "first"
    allowed_domains = ["9gag.com"]
    start_urls = ('http://www.9gag.com/', )

    last_gag_id = None
    COUNT_MAX = 30
    count = 0

    def parse(self, response):

        for article in response.xpath('//article'):
            gag_id = article.xpath('@data-entry-id').extract()
            ninegag_item = GagItem()
            ninegag_item['entry_id'] = gag_id[0]
            ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
            ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
            ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
            ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
            ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
            self.last_gag_id = gag_id[0]
            self.count = self.count + 1
            yield ninegag_item

        if (self.count < self.COUNT_MAX):
            next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id
            yield scrapy.Request(url=next_url, callback=self.parse)
查看更多
该账号已被封号
5楼-- · 2019-03-16 19:15

There's a built-in setting CLOSESPIDER_PAGECOUNT that can be passed via command-line -s argument or changed in settings: scrapy crawl <spider> -s CLOSESPIDER_PAGECOUNT=100

One small caveat is that if you've enabled caching, it will count cache hits as page counts as well.

查看更多
姐就是有狂的资本
6楼-- · 2019-03-16 19:24

Spider arguments are passed through the crawl command using the -a option.check link

查看更多
登录 后发表回答