I want crawl a website which has two level url,the first level is a muti-page list ,url like this:
Page layout like this:
- List item link 1
- List item link 2
- List item link 3
- List item link 4
1,2,3,4,5 ... nextpage
and the second level is a detail page,url like this:
my spider code is:
import scrapy
from scrapy.spiders.crawl import CrawlSpider
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.spiders.crawl import Rule
from urlparse import urljoin
class MyCrawler(CrawlSpider):
name = "AnjukeCrawler"
start_urls=[
"http://www.example.com/group/"
]
rules = [
Rule(LxmlLinkExtractor(allow=(),
restrict_xpaths=(["//div[@class='multi- page']/a[@class='aNxt']"])),
callback='parse_list_page',
follow=True)
]
def parse_list_page(self, response):
list_page=response.xpath("//div[@class='li- itemmod']/div/h3/a/@href").extract()
for item in list_page:
yield scrapy.http.Request(self,url=urljoin(response.url,item),callback=self.parse_detail_page)
def parse_detail_page(self,response):
community_name=response.xpath("//dl[@class='comm-l-detail float-l']/dd")[0].extract()
self.log(community_name,2)
My question is :my parse_detail_page seem never run ,somebody can tell me why? How can I fix it?
thanks!
If I'm understanding your question then what you are looking here for is request chaining. Request chaining is when you carry over the data gathered from response1 to response2 via request:
You should never overwrite
parse
method ofCrawlSpider
because it contains core parsing logic for this type of spiders, so yourdef parse(
should bedef parse_list_page(
- and that typo is your issue.However your rule looks like overhead because of using both callback and
follow=True
just to extract links, it is better to consider using of the list of rules and rewrite your spider like this:BTW, too many brackets in the link extractor:
restrict_xpaths=(["//div[@class='multi- page']/a[@class='aNxt']"])