How can I crawl the two level website with scrapy

2019-06-13 04:41发布

I want crawl a website which has two level url,the first level is a muti-page list ,url like this:

Page layout like this:

  • List item link 1
  • List item link 2
  • List item link 3
  • List item link 4

1,2,3,4,5 ... nextpage

and the second level is a detail page,url like this:

my spider code is:

import scrapy
from scrapy.spiders.crawl import CrawlSpider
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.spiders.crawl import Rule
from urlparse import urljoin

class MyCrawler(CrawlSpider):
    name = "AnjukeCrawler"

    start_urls=[
        "http://www.example.com/group/"
    ]

    rules = [
        Rule(LxmlLinkExtractor(allow=(),
                               restrict_xpaths=(["//div[@class='multi-   page']/a[@class='aNxt']"])),
                               callback='parse_list_page',
                               follow=True)
    ]

    def parse_list_page(self, response):

        list_page=response.xpath("//div[@class='li-  itemmod']/div/h3/a/@href").extract()

        for item in list_page:
            yield scrapy.http.Request(self,url=urljoin(response.url,item),callback=self.parse_detail_page)


    def parse_detail_page(self,response):

        community_name=response.xpath("//dl[@class='comm-l-detail float-l']/dd")[0].extract()

        self.log(community_name,2)  

My question is :my parse_detail_page seem never run ,somebody can tell me why? How can I fix it?

thanks!

标签: python scrapy
2条回答
▲ chillily
2楼-- · 2019-06-13 04:53

If I'm understanding your question then what you are looking here for is request chaining. Request chaining is when you carry over the data gathered from response1 to response2 via request:

def parse(self, response):
    item = dict()
    item['name'] = response.xpath("...").extract_first()

    urls = response.xpath("//a/@href").extract()
    for url in urls:
        yield Request(url, self.parse2,
                      meta={'item':item})  # <-- this is the important bit

def prase2(self, response):
    # now lets retrieve our item generated in parse()
    item = response.meta['item']
    item['last_name'] = response.xpath("...").extract_first()
    return item
    # {'name': 'some_name', 'last_name': 'some_last_name'}
查看更多
我命由我不由天
3楼-- · 2019-06-13 04:59

You should never overwrite parse method of CrawlSpider because it contains core parsing logic for this type of spiders, so your def parse( should be def parse_list_page( - and that typo is your issue.

However your rule looks like overhead because of using both callback and follow=True just to extract links, it is better to consider using of the list of rules and rewrite your spider like this:

class MyCrawler(CrawlSpider):
    name = "AnjukeCrawler"

    start_urls = [
        "http://www.example.com/group/"
    ]

    rules = [
        Rule(LxmlLinkExtractor(restrict_xpaths="//div[@class='multi-page']/a[@class='aNxt']"),
             follow=True),
        Rule(LxmlLinkExtractor(restrict_xpaths="//div[@class='li-itemmod']/div/h3/a/@href"),
             callback='parse_detail_page'),
    ]

    def parse_detail_page(self, response):
        community_name = response.xpath("//dl[@class='comm-l-detail float-l']/dd")[0].extract()
        self.log(community_name, 2)

BTW, too many brackets in the link extractor: restrict_xpaths=(["//div[@class='multi- page']/a[@class='aNxt']"])

查看更多
登录 后发表回答