I am trying to scrape data from a page which has a lot of AJAX calls and javascript execution to render the webpage.So I am trying to use scrapy with selenium to do this. The modus operandi is as follow :
Add the login page URL to the scrapy start_urls list
Use the formrequest from response method to post the username and password to get authenticated.
- Once logged in,request for the desired page to be scraped
- Pass this response to the Selenium Webdriver to click buttons on the page.
- Once the buttons are clicked and a new webpage is rendered,capture the result.
The code that I have thus far is as follows:
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest, Request
from selenium import webdriver
import time
class LoginSpider(BaseSpider):
name = "sel_spid"
start_urls = ["http://www.example.com/login.aspx"]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
return FormRequest.from_response(response,
formdata={'User': 'username', 'Pass': 'password'},
callback=self.check_login_response)
def check_login_response(self, response):
if "Log Out" in response.body:
self.log("Successfully logged in")
scrape_url = "http://www.example.com/authen_handler.aspx?SearchString=DWT+%3E%3d+500"
yield Request(url=scrape_url, callback=self.parse_page)
else:
self.log("Bad credentials")
def parse_page(self, response):
self.driver.get(response.url)
next = self.driver.find_element_by_class_name('dxWeb_pNext')
next.click()
time.sleep(2)
# capture the html and store in a file
The 2 roadblocks i have hit till now are:
Step 4 does not work.Whenever selenium open the firefox window,it is always at the login screen and does not know how to get past it.
I don't know how to achieve step 5
Any help will be greatly appreciated
log in with scrapy api first
pass session to selenium chrome driver
I don't believe you can switch between scrapy Requests and selenium like that. You need to log into the site using selenium, not yield Request(). The login session you created with scrapy is not transfered to the selenium session. Here is an example (the element ids/xpath will be different for you):
then you can do:
etc.
EDIT: If you need to render javascript and are worried about speed/non-blocking, you can use http://splash.readthedocs.org/en/latest/index.html which should do the trick.
http://splash.readthedocs.org/en/latest/scripting-ref.html#splash-add-cookie has details on passing a cookie, you should be able to pass it from scrapy, but I have not done it before.