Scrapy start Crawling after login

2019-03-04 19:46发布

问题:

Disclaimer: The site I am crawling is a corporate intranet and I modified the url a bit for corporate privacy.

I managed to log into the site but I have failed to crawl the site.

Start from start_url https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf( this site would direct you to a similar site with more complex url :

i.e.

https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument {unid=ADE682E34FC59D274825770B0037D278})

for every page including the start_url, I want to crawl all href found under //li/<a>( For every page it crawled, there would be abundant number of hyperlinks available, and some of them will duplicate because you can access both the parent and children sites on the same page.

As you may see, the href does not composite the actual link ( the link quoted above) we see when we crawl into that page. There is also a # in front of its useful content. Would it be the source of problem?

For restricted_xpaths,I have restricted the path to 'logout' the page.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.linkextractors import LinkExtractor
import scrapy

class kmssSpider(CrawlSpider):
    name='kmss'
    start_url = ('https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf',)
    login_page = 'https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login'
    allowed_domain = ["kmssqkr.sarg"]

    rules= (Rule(LinkExtractor(allow=(r'https://kmssqkr.sarg/LotusQuickr/dept/\w*'),restrict_xpaths=('//*[@id="quickr_widgets_misc_loginlogout_0"]/a'),unique= True),
                  callback='parse_item', follow = True),
                                )
#    r"LotusQuickr/dept/^[ A-Za-z0-9_@./#&+-]*$"
#    restrict_xpaths=('//*[@id="quickr_widgets_misc_loginlogout_0"]/a'),unique = True)

    def start_requests(self):
        yield Request(url=self.login_page, callback=self.login ,dont_filter = True
                )
    def login(self,response):
        return FormRequest.from_response(response,formdata={'user':'user','password':'pw'},
                                        callback = self.check_login_response)

    def check_login_response(self,response):
        if 'Welcome' in response.body:
            self.log("\n\n\n\n Successfuly Logged in \n\n\n ")
            yield Request(url=self.start_url[0])
        else:
            self.log("\n\n You are not logged in \n\n " )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)
        pass

Log:

2015-07-27 16:46:18 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-27 16:46:18 [boto] DEBUG: Retrieving credentials from metadata server.
2015-07-27 16:46:19 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 431, in open
    response = self._open(req, data)
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 449, in _open
    '_open', req)
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1227, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1197, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2015-07-27 16:46:19 [boto] ERROR: Unable to read instance data, giving up
2015-07-27 16:46:19 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-27 16:46:19 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-27 16:46:19 [scrapy] INFO: Enabled item pipelines: 
2015-07-27 16:46:19 [scrapy] INFO: Spider opened
2015-07-27 16:46:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-27 16:46:19 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-27 16:46:24 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login> (referer: None)
2015-07-27 16:46:28 [scrapy] DEBUG: Crawled (200) <POST https://kmssqkr.ccgo.sarg/names.nsf?Login> (referer: https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login)
2015-07-27 16:46:29 [kmss] DEBUG: 



 Successfuly Logged in 



2015-07-27 16:46:29 [scrapy] DEBUG: Redirecting (302) to <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_Toc/d0a58cff88e9100b852572c300517498/?OpenDocument> from <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf>
2015-07-27 16:46:29 [scrapy] DEBUG: Redirecting (302) to <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument> from <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_Toc/d0a58cff88e9100b852572c300517498/?OpenDocument>
2015-07-27 16:46:29 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument> (referer: https://kmssqkr.sarg/names.nsf?Login)
2015-07-27 16:46:29 [scrapy] INFO: Closing spider (finished)
2015-07-27 16:46:29 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1954,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 4,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 31259,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 3,
 'downloader/response_status_count/302': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 27, 8, 46, 29, 286000),
 'log_count/DEBUG': 8,
 'log_count/ERROR': 2,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'request_depth_max': 2,
 'response_received_count': 3,
 'scheduler/dequeued': 5,
 'scheduler/dequeued/memory': 5,
 'scheduler/enqueued': 5,
 'scheduler/enqueued/memory': 5,
 'start_time': datetime.datetime(2015, 7, 27, 8, 46, 19, 528000)}
2015-07-27 16:46:29 [scrapy] INFO: Spider closed (finished)

  [1]: http://i.stack.imgur.com/REQXJ.png

----------------------------------UPDATED---------------------------------------

I saw the cookies format in http://doc.scrapy.org/en/latest/topics/request-response.html. These are my cookies on the site, but I am not sure what and How I should add them along with Request.

回答1:

First of all do not be demanding, sometimes I get angry and won't answer your question.

To see which cookies are sent with your Request enable debugging with COOKIES_DEBUG = True.

Then you will notice that cookies are not sent even if Scrapy's middleware should send those cookies. I think this is because you yield a custom request and Scrapy won't be more clever than you and accepts your solution to send this request without cookies.

This means you need to access the cookies from the response and add the required ones (or all) to your Request.