Using InitSpider with splash: only parsing the log

2019-02-08 22:35发布

站内文章 / Python

21 0

爷的心禁止访问

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

This is sort of a follow-up question to one I asked earlier.

I'm trying to scrape a webpage which I have to login to reach first. But after authentication, the webpage I need requires a little bit of Javascript to be run before you can view the content. What I've done is followed the instructions here to install splash to try to render the Javascript. However...

Before I switched to splash, the authentication with Scrapy's InitSpider was fine. I was getting through the login page and scraping the target page OK (except without the Javascript working, obviously). But once I add the code to pass the requests through splash, it looks like I'm not parsing the target page.

Spider below. The only difference between the splash version (here) and the non-splash version is the function def start_requests(). Everything else is the same between the two.

import scrapy
from scrapy.spiders.init import InitSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor

class BboSpider(InitSpider):
    name = "bbo"
    allowed_domains = ["bridgebase.com"]
    start_urls = [
            "http://www.bridgebase.com/myhands/index.php"
            ]
    login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F" 

    # authentication
    def init_request(self):
        return scrapy.http.Request(url=self.login_page, callback=self.login)

    def login(self, response):
        return scrapy.http.FormRequest.from_response(
            response,
            formdata={'username': 'USERNAME', 'password': 'PASSWORD'},
            callback=self.check_login_response)

    def check_login_response(self, response):
        if "recent tournaments" in response.body:
            self.log("Login successful")
            return self.initialized()
        else:
            self.log("Login failed")
            print(response.body)

    # pipe the requests through splash so the JS renders 
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            }) 

    # what to do when a link is encountered
    rules = (
            Rule(LinkExtractor(), callback='parse_item'),
            )

    # do nothing on new link for now
    def parse_item(self, response):
        pass

    def parse(self, response):
        filename = 'test.html' 
        with open(filename, 'wb') as f:
            f.write(response.body)

What's happening now is that test.html, the result of parse(), is now simply the login page itself rather than the page I'm supposed to be redirected to after login.

This is telling in the log--ordinarily, I would see the "Login successful" line from check_login_response(), but as you can see below it seems like I'm not even getting to that step. Is this because scrapy is now putting the authentication requests through splash too, and that it's getting hung up there? If that's the case, is there any way to bypass splash only for the authentication part?

2016-01-24 14:54:56 [scrapy] INFO: Spider opened
2016-01-24 14:54:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-24 14:54:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-24 14:55:02 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
2016-01-24 14:55:02 [scrapy] INFO: Closing spider (finished)

I'm pretty sure I'm not using splash correctly. Can anyone point me to some documentation where I can figure out what's going on?

回答1:

I don't think Splash alone would handle this particular case well.

Here is the working idea:

use selenium and PhantomJS headless browser to log into the website
pass the browser cookies from PhantomJS into Scrapy

The code:

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


class BboSpider(scrapy.Spider):
    name = "bbo"
    allowed_domains = ["bridgebase.com"]
    login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F"

    def start_requests(self):
        driver = webdriver.PhantomJS()
        driver.get(self.login_page)

        driver.find_element_by_id("username").send_keys("user")
        driver.find_element_by_id("password").send_keys("password")

        driver.find_element_by_name("submit").click()

        driver.save_screenshot("test.png")
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Click here for results of recent tournaments")))

        cookies = driver.get_cookies()
        driver.close()

        yield scrapy.Request("http://www.bridgebase.com/myhands/index.php", cookies=cookies)

    def parse(self, response):
        if "recent tournaments" in response.body:
            self.log("Login successful")
        else:
            self.log("Login failed")
        print(response.body)

Prints Login successful and the HTML of the "hands" page.

回答2:

Update

So, it seems that start_requests fires before the login.

Here is the code from InitSpider, minus comments.

class InitSpider(Spider):
    def start_requests(self):
        self._postinit_reqs = super(InitSpider, self).start_requests()
        return iterate_spider_output(self.init_request())

    def initialized(self, response=None):
        return self.__dict__.pop('_postinit_reqs')

    def init_request(self):
        return self.initialized()

InitSpider calls the main start_requests with initialized.

Your start_requests is a modified version of the base class's method. So maybe something like this will work.

from scrapy.utils.spider import iterate_spider_output

...

def start_requests(self):
    self._postinit_reqs = my_start_requests()
    return iterate_spider_output(self.init_request())

def my_start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, self.parse, meta={
            'splash': {
                'endpoint': 'render.html',
                'args': {'wait': 0.5}
            }
        })

~~You need to return self.initialized()~~

回答3:

You can get all the data without the need for js at all, there are links available for browsers that do not have javascript enabled, the urls are the same bar ?offset=0. You just need to parse the queries from the tourney url you are interested in and create a Formrequest.

import scrapy
from scrapy.spiders.init import InitSpider
from urlparse import parse_qs, urlparse


class BboSpider(InitSpider):
    name = "bbo"
    allowed_domains = ["bridgebase.com"]
    start_urls = [
        "http://www.bridgebase.com/myhands/index.php"
    ]

    login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F"

    def start_requests(self):
        return [scrapy.FormRequest(self.login_page,
                                   formdata={'username': 'foo', 'password': 'bar'}, callback=self.parse)]

    def parse(self, response):
        yield scrapy.Request("http://www.bridgebase.com/myhands/index.php?offset=0", callback=self.get_all_tournaments)

    def get_all_tournaments(self, r):
        url = r.xpath("//a/@href[contains(., 'tourneyhistory')]").extract_first()
        yield scrapy.Request(url, callback=self.chosen_tourney)

    def chosen_tourney(self, r):
        url = r.xpath("//a[contains(./text(),'Speedball')]/@href").extract_first()
        query = urlparse(url).query
        yield scrapy.FormRequest("http://webutil.bridgebase.com/v2/tarchive.php?offset=0", callback=self.get_tourney_data_links,
                                 formdata={k: v[0] for k, v in parse_qs(query).items()})

    def get_tourney_data_links(self, r):
        print r.xpath("//a/@href").extract()

There are numerous links in the output, for hands you get the tview.php?-t=...., you can request each one joining to http://webutil.bridgebase.com/v2/ and it will give you a table of all the data that is easy to parse, there are also links to tourney=4796-1455303720-&username=... associated with each hand in the tables, a snippet of the output from the tview link:

class="bbo_tr_t">
    <table class="bbo_t_l">
    <tr><td class="bbo_tll" align="left">Title</td><td class="bbo_tlv">#4796 Ind.  ACBL Fri 2pm</td></tr>
    <tr><td class="bbo_tll" align="left">Host</td><td class="bbo_tlv">ACBL</td></tr>
    <tr><td class="bbo_tll" align="left">Tables</td><td class="bbo_tlv">9</td></tr>



    </table>

    </div><div class='sectionbreak'>Section 1 </div><div class='onesection'> <table class='sectiontable' ><tr><th>Name</th><th>Score (IMPs)</th><th class='rank'>Rank</th><th>Prize</th><th>Points</th></tr>
<tr class='odd'><td>colt22</td><td><a href="http://www.bridgebase.com/myhands/hands.php?tourney=4796-1455303720-&username=colt22" target="_blank">42.88</a></td><td class='rank' >1</td><td></td><td>0.90</td></tr>
<tr class='even'><td>francha</td><td><a href="http://www.bridgebase.com/myhands/hands.php?tourney=4796-1455303720-&username=francha" target="_blank">35.52</a></td><td class='rank' >2</td><td></td><td>0.63</td></tr>
<tr class='odd'><td>MSMK</td><td><a href="http://www.bridgebase.com/myhands/hands.php?tourney=4796-1455303720-&username=MSMK" target="_blank">34.38</a></td><td class='rank' >3</td><td></td><td>0.45</td></tr>

The rest of the parsing I will leave to yourself.