I'm crawling a page that generates data with infinite scrolling.
I'm using CrawlSpider, and the rules
are defined like this:
rules = (
Rule(LinkExtractor(restrict_xpaths = ('//*some/xpaths')), callback = 'parse_first_itmes', follow = True),
Rule(LinkExtractor(restrict_xpaths = ('//*some/other/xpaths')), callback = 'parse_second_itmes'),
)
In the parse_item
function, I have a Request
makes the AJAX
requests:
def parse_first_items(self, response):
l = ItemLoader(item = AmazonCnCustomerItem(), response = response)
l.add_xpath('field1', '//p[@class="field1")]/text()')
l.add_xpath('field2', '//p[@class="field2")]/text()')
r_url = l.get_xpath('//*/url/xpath/@href')
r = Request(url = req_url,
headers = {"Referer": "the/same/page/url",
"X-Requested-With": "XMLHttpRequest"},
callback = self.parse_first_items)
return r, l.load_item()
I get the desired data just fine, but the LinkExtractor
in the Second Rule
does not catch the urls
from the data generated by the Request
inside the the parse_first_itmes
function.
How can I make the LinkExtractor
in the Second Rule
extract those links and use them to parse the parse_second_itmes
function?