scrapy scrap on all pages that have this syntax

2019-04-02 17:05发布

I want to scrapy on all pages that have this syntaxt

mywebsite/?page=INTEGER

I tried this:

start_urls = ['MyWebsite']
rules = [Rule(SgmlLinkExtractor(allow=['/\?page=\d+']), 'parse')]

but it seems that the link still MyWebsite. so please what should I do to make it understand that i want to add /?page=NumberOfPage ? please?

edit

i mean that i want to scrap these pages:

mywebsite/?page=1
mywebsite/?page=2
mywebsite/?page=3
mywebsite/?page=4
mywebsite/?page=5
..
..
..
mywebsite/?page=7677654

my code

start_urls = [
        'http://example.com/?page=%s' % page for page in xrange(1,100000)
    ]
def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('my xpath')
    for site in sites:

        DateDifference= site.xpath('xpath for date difference').extract()[0]

        if DateDifference.days < 8:
            yield Request(Link, meta={'date': Date}, callback = self.crawl)

I want to get all the data of pages that have been added in the last 7 days. I don't know how many pages have been added in the last 7 days. so i think that i can crawl on a larg number of pages , lets say 100000, then i check the datedifference if it is less that 7 days i want to yield if not i want to stop crawling at all.

2条回答
趁早两清
2楼-- · 2019-04-02 17:35

If I get it right, you want to crawl all pages that are younger than 7 days. One way to do it is to follow each page in order (assuming page n°1 is the youngest, n°2 is older than n°1, n°3 older than n°2...).

You can do something like

start_urls = ['mywebsite/?page=1']

def parse(self, response):
    sel = Selector(response)
    DateDifference= sel.xpath('xpath for date difference').extract()[0]

    i = response.meta['index'] if 'index' in response.meta else 1

    if DateDifference.days < 8:
        yield Request(Link, meta={'date': Date}, callback = self.crawl)
        i += 1
        yield Request('mywebsite/?page='+str(i), meta={'index':i}, callback=self.parse)

The idea is to execute parse sequentially. If this is the first time you enter the function, response.meta['index'] isn't defined: the index is 1. If this is a call after we already parsed another page, response.meta['index'] is defined: the index indicates the number of the page currently scraped.

查看更多
我想做一个坏孩纸
3楼-- · 2019-04-02 17:49

CrawlSpider with rules will not help in this cases. Rules are used to extract links from the first page which match your patterns. Obviously your start url page doesn't have links to all those pages, that's why you don't get them.

Something like this should work:

class MyWebsiteSpider(Spider):
    ...

    def start_requests(self):
        for i in xrange(7677654):
            yield self.make_requests_from_url('mywebsite/?page=%d' % i)
查看更多
登录 后发表回答