I want to scrapy on all pages that have this syntaxt
mywebsite/?page=INTEGER
I tried this:
start_urls = ['MyWebsite']
rules = [Rule(SgmlLinkExtractor(allow=['/\?page=\d+']), 'parse')]
but it seems that the link still MyWebsite
. so please what should I do to make it understand that i want to add /?page=NumberOfPage
? please?
edit
i mean that i want to scrap these pages:
mywebsite/?page=1
mywebsite/?page=2
mywebsite/?page=3
mywebsite/?page=4
mywebsite/?page=5
..
..
..
mywebsite/?page=7677654
my code
start_urls = [
'http://example.com/?page=%s' % page for page in xrange(1,100000)
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('my xpath')
for site in sites:
DateDifference= site.xpath('xpath for date difference').extract()[0]
if DateDifference.days < 8:
yield Request(Link, meta={'date': Date}, callback = self.crawl)
I want to get all the data of pages that have been added in the last 7 days. I don't know how many pages have been added in the last 7 days. so i think that i can crawl on a larg number of pages , lets say 100000, then i check the datedifference
if it is less that 7 days i want to yield
if not i want to stop crawling at all.
If I get it right, you want to crawl all pages that are younger than 7 days. One way to do it is to follow each page in order (assuming page n°1 is the youngest, n°2 is older than n°1, n°3 older than n°2...).
You can do something like
The idea is to execute
parse
sequentially. If this is the first time you enter the function,response.meta['index']
isn't defined: the index is 1. If this is a call after we already parsed another page,response.meta['index']
is defined: the index indicates the number of the page currently scraped.CrawlSpider
with rules will not help in this cases. Rules are used to extract links from the first page which match your patterns. Obviously your start url page doesn't have links to all those pages, that's why you don't get them.Something like this should work: