可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm analyzing an HTML page which has a two level menu. When the top-level menu changed, there's an AJAX request sent to get second-level menu item. When the top and second menu are both selected, then refresh the content.

What I need is sending another request and get the submenu response in the scrapy's parse function. So I can iterate the submenu, build scrapy.Request per submenu item.

The pseudo code like this:

def parse(self, response):
    top_level_menu = response.xpath('//TOP_LEVEL_MENU_XPATH')
    second_level_menu_items = ## HERE I NEED TO SEND A REQUEST AND GET RESULT, PARSED TO ITME VALUE LIST

    for second_menu_item in second_level_menu_items:
        yield scrapy.Request(response.urljoin(content_request_url + '？top_level=' + top_level_menu + '&second_level_menu=' + second_menu_item), callback=self.parse_content)

How can I do this?

Using requests lib directly? Or some other feature provided by scrapy?

回答1:

The recommended approach here is to create another callback (parse_second_level_menus?) to handle the response for the second level menu items and in there, create the requests to the content pages.

Also, you can use the request.meta attribute to pass data between callback methods (more info here).

It could be something along these lines:

def parse(self, response):
    top_level_menu = response.xpath('//TOP_LEVEL_MENU_XPATH').get()
    yield scrapy.Request(
        some_url,
        callback=self.parse_second_level_menus,
        # pass the top_level_menu value to the other callback
        meta={'top_menu': top_level_menu},
    )

def parse_second_level_menus(self, response):
    # read the data passed in the meta by the first callback
    top_level_menu = response.meta.get('top_menu')
    second_level_menu_items = response.xpath('...').getall()

    for second_menu_item in second_level_menu_items:
        url = response.urljoin(content_request_url + '？top_level=' + top_level_menu + '&second_level_menu=' + second_menu_item)
        yield scrapy.Request(
            url,
            callback=self.parse_content
    )

def parse_content(self, response):
    ...

Yet another approach (less recommended in this case) would be using this library: https://github.com/rmax/scrapy-inline-requests

回答2:

Simply use dont_filter=True for your Request example:

def start_requests(self):
    return [Request(url=self.base_url, callback=self.parse_city)]

def parse_city(self, response):
    for next_page in response.css('a.category'):
        url = self.base_url + next_page.attrib['href']
        self.log(url)
        yield Request(url=url,  callback=self.parse_something_else, dont_filter=True)

def parse_something_else(self, response):
    for next_page in response.css('#contentwrapper > div > div > div.component > table > tbody > tr:nth-child(2) > td > form > table > tbody > tr'):
        url = self.base_url + next_page.attrib['href']
        self.log(url)
        yield Request(url=next_page, callback=self.parse, dont_filter=True)

def parse(self, response):
    pass