Scrapy with selenium for a webpage requiring authe

I am trying to scrape data from a page which has a lot of AJAX calls and javascript execution to render the webpage.So I am trying to use scrapy with selenium to do this. The modus operandi is as follow :

Add the login page URL to the scrapy start_urls list
Use the formrequest from response method to post the username and password to get authenticated.
Once logged in,request for the desired page to be scraped
Pass this response to the Selenium Webdriver to click buttons on the page.
Once the buttons are clicked and a new webpage is rendered,capture the result.

The code that I have thus far is as follows:

 from scrapy.spider import BaseSpider
 from scrapy.http import FormRequest, Request
 from selenium import webdriver
 import time


 class LoginSpider(BaseSpider):
    name = "sel_spid"
    start_urls = ["http://www.example.com/login.aspx"]


    def __init__(self):
        self.driver = webdriver.Firefox()


    def parse(self, response):
        return FormRequest.from_response(response,
               formdata={'User': 'username', 'Pass': 'password'},
               callback=self.check_login_response)

    def check_login_response(self, response):
        if "Log Out" in response.body:
            self.log("Successfully logged in")
            scrape_url = "http://www.example.com/authen_handler.aspx?SearchString=DWT+%3E%3d+500"
            yield Request(url=scrape_url, callback=self.parse_page)
        else:
            self.log("Bad credentials")

    def parse_page(self, response):
        self.driver.get(response.url)
        next = self.driver.find_element_by_class_name('dxWeb_pNext')
        next.click()
        time.sleep(2)
        # capture the html and store in a file

The 2 roadblocks i have hit till now are:

Step 4 does not work.Whenever selenium open the firefox window,it is always at the login screen and does not know how to get past it.
I don't know how to achieve step 5

Any help will be greatly appreciated

标签： python selenium scrapy

2条回答

三岁会撩人

2楼-- · 2019-02-11 00:05

# call scrapy post request with after_login as callback
    return FormRequest.from_response(
        response,
        # formxpath=formxpath,
        formdata=formdata,
        callback=self.browse_files
    )

pass session to selenium chrome driver

# logged in previously with scrapy api   
def browse_files(self, response):
    print "browse files for: %s" % (response.url)

    # response.headers        
    cookie_list2 = response.headers.getlist('Set-Cookie')
    print cookie_list2

    self.driver.get(response.url)
    self.driver.delete_all_cookies()

    # extract all the cookies
    for cookie2 in cookie_list2:
        cookies = map(lambda e: e.strip(), cookie2.split(";"))

        for cookie in cookies:
            splitted = cookie.split("=")
            if len(splitted) == 2:
                name = splitted[0]
                value = splitted[1]
                #for my particular usecase I needed only these values
                if name == 'csrftoken' or name == 'sessionid':
                    cookie_map = {"name": name, "value": value}
                else:
                    continue
            elif len(splitted) == 1:
                cookie_map = {"name": splitted[0], "value": ''}
            else:
                continue

            print "adding cookie"
            print cookie_map
            self.driver.add_cookie(cookie_map)

    self.driver.get(response.url)

    # check if we have successfully logged in
    files = self.wait_for_elements_to_be_present(By.XPATH, "//*[@id='files']", response)
    print files

0人赞添加讨论(0) 举报

混吃等死

3楼-- · 2019-02-11 00:29

I don't believe you can switch between scrapy Requests and selenium like that. You need to log into the site using selenium, not yield Request(). The login session you created with scrapy is not transfered to the selenium session. Here is an example (the element ids/xpath will be different for you):

    scrape_url = "http://www.example.com/authen_handler.aspx"
    driver.get(scrape_url)
    time.sleep(2)
    username = self.driver.find_element_by_id("User")
    password =  self.driver.find_element_by_name("Pass")
    username.send_keys("your_username")
    password.send_keys("your_password")
    self.driver.find_element_by_xpath("//input[@name='commit']").click()

then you can do:

    time.sleep(2)
    next = self.driver.find_element_by_class_name('dxWeb_pNext').click()
    time.sleep(2)

etc.

EDIT: If you need to render javascript and are worried about speed/non-blocking, you can use http://splash.readthedocs.org/en/latest/index.html which should do the trick.

http://splash.readthedocs.org/en/latest/scripting-ref.html#splash-add-cookie has details on passing a cookie, you should be able to pass it from scrapy, but I have not done it before.

0人赞添加讨论(0) 举报

Scrapy with selenium for a webpage requiring authe

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间