I'm analyzing an HTML page which has a two level menu.
When the top-level menu changed, there's an AJAX request sent to get second-level menu item. When the top and second menu are both selected, then refresh the content.
What I need is sending another request and get the submenu response in the scrapy's parse
function. So I can iterate the submenu, build scrapy.Request
per submenu item.
The pseudo code like this:
def parse(self, response):
top_level_menu = response.xpath('//TOP_LEVEL_MENU_XPATH')
second_level_menu_items = ## HERE I NEED TO SEND A REQUEST AND GET RESULT, PARSED TO ITME VALUE LIST
for second_menu_item in second_level_menu_items:
yield scrapy.Request(response.urljoin(content_request_url + '?top_level=' + top_level_menu + '&second_level_menu=' + second_menu_item), callback=self.parse_content)
How can I do this?
Using requests
lib directly? Or some other feature provided by scrapy?
The recommended approach here is to create another callback (parse_second_level_menus
?) to handle the response for the second level menu items and in there, create the requests to the content pages.
Also, you can use the request.meta
attribute to pass data between callback methods (more info here).
It could be something along these lines:
def parse(self, response):
top_level_menu = response.xpath('//TOP_LEVEL_MENU_XPATH').get()
yield scrapy.Request(
some_url,
callback=self.parse_second_level_menus,
# pass the top_level_menu value to the other callback
meta={'top_menu': top_level_menu},
)
def parse_second_level_menus(self, response):
# read the data passed in the meta by the first callback
top_level_menu = response.meta.get('top_menu')
second_level_menu_items = response.xpath('...').getall()
for second_menu_item in second_level_menu_items:
url = response.urljoin(content_request_url + '?top_level=' + top_level_menu + '&second_level_menu=' + second_menu_item)
yield scrapy.Request(
url,
callback=self.parse_content
)
def parse_content(self, response):
...
Yet another approach (less recommended in this case) would be using this library: https://github.com/rmax/scrapy-inline-requests
Simply use dont_filter=True for your Request
example:
def start_requests(self):
return [Request(url=self.base_url, callback=self.parse_city)]
def parse_city(self, response):
for next_page in response.css('a.category'):
url = self.base_url + next_page.attrib['href']
self.log(url)
yield Request(url=url, callback=self.parse_something_else, dont_filter=True)
def parse_something_else(self, response):
for next_page in response.css('#contentwrapper > div > div > div.component > table > tbody > tr:nth-child(2) > td > form > table > tbody > tr'):
url = self.base_url + next_page.attrib['href']
self.log(url)
yield Request(url=next_page, callback=self.parse, dont_filter=True)
def parse(self, response):
pass