I'm new to Scrapy-splash
and I'm trying to scrape a lazy datatable
which is a table with AJAX pagination.
So I need to load the website, wait until JS
is executed, get html of the table and then click on the "Next" button on pagination.
My approach works but I'm afraid I'm requesting the website two times.
First time when I yield the SplashRequest
and then when lua_script
is executed.
Is it true? If yes, how to make it perform request just once?
class JSSpider(scrapy.Spider):
name = 'js_spider'
script = """
function main(splash, args)
splash:go(args.url)
splash:wait(0.5)
local page_one = splash:evaljs("$('#example').html()")
splash:evaljs("$('#example_next').click()")
splash:wait(2)
local page_two = splash:evaljs("$('#example').html()")
return {page_one=page_one,page_two=page_two}
end"""
def start_requests(self):
url = f"""https://datatables.net/examples/server_side/defer_loading.html"""
yield SplashRequest(url, endpoint='execute',callback=self.parse, args={'wait': 0.5,'lua_source':self.script,'url':url})
def parse(self, response):
# assert isinstance(response, SplashTextResponse)
page_one = response.data.get('page_one',None)
page_one_root = etree.fromstring(page_one, HTMLParser())
page_two = response.data.get('page_two',None)
page_two_root = etree.fromstring(page_one, HTMLParser())
EDIT
Also I would like to wait until AJAX
is performed better way than just splash:wait(2)
. Is it possible to somehow wait until the table changed? Ideally with some timeout.