I am developing an simple scraper to get 9 gag posts and its images but due to some technical difficulties iam unable to stop the scraper and it keeps on scraping which i dont want.I want to increase the counter value and stop after 100 posts. But the 9gag page was designed in a fashion in each response it gives only 10 posts and after each iteration my counter value resets to 10 in this case my loop runs infintely long and never stops.
# -*- coding: utf-8 -*-
import scrapy
from _9gag.items import GagItem
class FirstSpider(scrapy.Spider):
name = "first"
allowed_domains = ["9gag.com"]
start_urls = (
'http://www.9gag.com/',
)
last_gag_id = None
def parse(self, response):
count = 0
for article in response.xpath('//article'):
gag_id = article.xpath('@data-entry-id').extract()
count +=1
if gag_id:
if (count != 100):
last_gag_id = gag_id[0]
ninegag_item = GagItem()
ninegag_item['entry_id'] = gag_id[0]
ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
yield ninegag_item
else:
break
next_url = 'http://9gag.com/?id=%s&c=200' % last_gag_id
yield scrapy.Request(url=next_url, callback=self.parse)
print count
Code for items.py is here
from scrapy.item import Item, Field
class GagItem(Item):
entry_id = Field()
url = Field()
votes = Field()
comments = Field()
title = Field()
img_url = Field()
So i want to increase a global count value and tried this by passing 3 arguments to parse function it gives error
TypeError: parse() takes exactly 3 arguments (2 given)
So is there a way to pass a global count value and return it after each iteration and stop after 100 posts(suppose).
Entire project is available here Github Even if i set POST_LIMIT =100 the infinite loop happens,see here command i executed
scrapy crawl first -s POST_LIMIT=10 --output=output.json
count
is local to theparse()
method so it's not preserved between pages. Change all occurences ofcount
toself.count
to make it an instance variable of the class and it will persist betwen pages.One can use
custom_settings
withCLOSESPIDER_PAGECOUNT
as shown below.First: Use
self.count
and initialize outside ofparse
. Then don't prevent the parsing of the items, but generating newrequests
. See the following code:There's a built-in setting
CLOSESPIDER_PAGECOUNT
that can be passed via command-line-s
argument or changed in settings:scrapy crawl <spider> -s CLOSESPIDER_PAGECOUNT=100
One small caveat is that if you've enabled caching, it will count cache hits as page counts as well.
Spider arguments are passed through the crawl command using the -a option.check link