Scrapy with selenium for a webpage requiring authe

2019-02-10 23:29发布

I am trying to scrape data from a page which has a lot of AJAX calls and javascript execution to render the webpage.So I am trying to use scrapy with selenium to do this. The modus operandi is as follow :

  1. Add the login page URL to the scrapy start_urls list

  2. Use the formrequest from response method to post the username and password to get authenticated.

  3. Once logged in,request for the desired page to be scraped
  4. Pass this response to the Selenium Webdriver to click buttons on the page.
  5. Once the buttons are clicked and a new webpage is rendered,capture the result.

The code that I have thus far is as follows:

 from scrapy.spider import BaseSpider
 from scrapy.http import FormRequest, Request
 from selenium import webdriver
 import time


 class LoginSpider(BaseSpider):
    name = "sel_spid"
    start_urls = ["http://www.example.com/login.aspx"]


    def __init__(self):
        self.driver = webdriver.Firefox()


    def parse(self, response):
        return FormRequest.from_response(response,
               formdata={'User': 'username', 'Pass': 'password'},
               callback=self.check_login_response)

    def check_login_response(self, response):
        if "Log Out" in response.body:
            self.log("Successfully logged in")
            scrape_url = "http://www.example.com/authen_handler.aspx?SearchString=DWT+%3E%3d+500"
            yield Request(url=scrape_url, callback=self.parse_page)
        else:
            self.log("Bad credentials")

    def parse_page(self, response):
        self.driver.get(response.url)
        next = self.driver.find_element_by_class_name('dxWeb_pNext')
        next.click()
        time.sleep(2)
        # capture the html and store in a file

The 2 roadblocks i have hit till now are:

  1. Step 4 does not work.Whenever selenium open the firefox window,it is always at the login screen and does not know how to get past it.

  2. I don't know how to achieve step 5

Any help will be greatly appreciated

2条回答
三岁会撩人
2楼-- · 2019-02-11 00:05

log in with scrapy api first

# call scrapy post request with after_login as callback
    return FormRequest.from_response(
        response,
        # formxpath=formxpath,
        formdata=formdata,
        callback=self.browse_files
    )

pass session to selenium chrome driver

# logged in previously with scrapy api   
def browse_files(self, response):
    print "browse files for: %s" % (response.url)

    # response.headers        
    cookie_list2 = response.headers.getlist('Set-Cookie')
    print cookie_list2

    self.driver.get(response.url)
    self.driver.delete_all_cookies()

    # extract all the cookies
    for cookie2 in cookie_list2:
        cookies = map(lambda e: e.strip(), cookie2.split(";"))

        for cookie in cookies:
            splitted = cookie.split("=")
            if len(splitted) == 2:
                name = splitted[0]
                value = splitted[1]
                #for my particular usecase I needed only these values
                if name == 'csrftoken' or name == 'sessionid':
                    cookie_map = {"name": name, "value": value}
                else:
                    continue
            elif len(splitted) == 1:
                cookie_map = {"name": splitted[0], "value": ''}
            else:
                continue

            print "adding cookie"
            print cookie_map
            self.driver.add_cookie(cookie_map)

    self.driver.get(response.url)

    # check if we have successfully logged in
    files = self.wait_for_elements_to_be_present(By.XPATH, "//*[@id='files']", response)
    print files
查看更多
混吃等死
3楼-- · 2019-02-11 00:29

I don't believe you can switch between scrapy Requests and selenium like that. You need to log into the site using selenium, not yield Request(). The login session you created with scrapy is not transfered to the selenium session. Here is an example (the element ids/xpath will be different for you):

    scrape_url = "http://www.example.com/authen_handler.aspx"
    driver.get(scrape_url)
    time.sleep(2)
    username = self.driver.find_element_by_id("User")
    password =  self.driver.find_element_by_name("Pass")
    username.send_keys("your_username")
    password.send_keys("your_password")
    self.driver.find_element_by_xpath("//input[@name='commit']").click()

then you can do:

    time.sleep(2)
    next = self.driver.find_element_by_class_name('dxWeb_pNext').click()
    time.sleep(2)

etc.

EDIT: If you need to render javascript and are worried about speed/non-blocking, you can use http://splash.readthedocs.org/en/latest/index.html which should do the trick.

http://splash.readthedocs.org/en/latest/scripting-ref.html#splash-add-cookie has details on passing a cookie, you should be able to pass it from scrapy, but I have not done it before.

查看更多
登录 后发表回答