Disclaimer: The site I am crawling is a corporate intranet and I modified the url a bit for corporate privacy.
I managed to log into the site but I have failed to crawl the site.
Start from start_url
https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf( this site would direct you to a similar site with more complex url :
i.e.
https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument {unid=ADE682E34FC59D274825770B0037D278})
for every page including the start_url
, I want to crawl all href
found under //li/<a>
( For every page it crawled, there would be abundant number of hyperlinks available, and some of them will duplicate because you can access both the parent and children sites on the same page.
As you may see, the href
does not composite the actual link ( the link quoted above) we see when we crawl into that page. There is also a # in front of its useful content. Would it be the source of problem?
For restricted_xpaths
,I have restricted the path to 'logout' the page.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.linkextractors import LinkExtractor
import scrapy
class kmssSpider(CrawlSpider):
name='kmss'
start_url = ('https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf',)
login_page = 'https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login'
allowed_domain = ["kmssqkr.sarg"]
rules= (Rule(LinkExtractor(allow=(r'https://kmssqkr.sarg/LotusQuickr/dept/\w*'),restrict_xpaths=('//*[@id="quickr_widgets_misc_loginlogout_0"]/a'),unique= True),
callback='parse_item', follow = True),
)
# r"LotusQuickr/dept/^[ A-Za-z0-9_@./#&+-]*$"
# restrict_xpaths=('//*[@id="quickr_widgets_misc_loginlogout_0"]/a'),unique = True)
def start_requests(self):
yield Request(url=self.login_page, callback=self.login ,dont_filter = True
)
def login(self,response):
return FormRequest.from_response(response,formdata={'user':'user','password':'pw'},
callback = self.check_login_response)
def check_login_response(self,response):
if 'Welcome' in response.body:
self.log("\n\n\n\n Successfuly Logged in \n\n\n ")
yield Request(url=self.start_url[0])
else:
self.log("\n\n You are not logged in \n\n " )
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
pass
Log:
2015-07-27 16:46:18 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-27 16:46:18 [boto] DEBUG: Retrieving credentials from metadata server.
2015-07-27 16:46:19 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 449, in _open
'_open', req)
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2015-07-27 16:46:19 [boto] ERROR: Unable to read instance data, giving up
2015-07-27 16:46:19 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-27 16:46:19 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-27 16:46:19 [scrapy] INFO: Enabled item pipelines:
2015-07-27 16:46:19 [scrapy] INFO: Spider opened
2015-07-27 16:46:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-27 16:46:19 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-27 16:46:24 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login> (referer: None)
2015-07-27 16:46:28 [scrapy] DEBUG: Crawled (200) <POST https://kmssqkr.ccgo.sarg/names.nsf?Login> (referer: https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login)
2015-07-27 16:46:29 [kmss] DEBUG:
Successfuly Logged in
2015-07-27 16:46:29 [scrapy] DEBUG: Redirecting (302) to <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_Toc/d0a58cff88e9100b852572c300517498/?OpenDocument> from <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf>
2015-07-27 16:46:29 [scrapy] DEBUG: Redirecting (302) to <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument> from <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_Toc/d0a58cff88e9100b852572c300517498/?OpenDocument>
2015-07-27 16:46:29 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument> (referer: https://kmssqkr.sarg/names.nsf?Login)
2015-07-27 16:46:29 [scrapy] INFO: Closing spider (finished)
2015-07-27 16:46:29 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1954,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 31259,
'downloader/response_count': 5,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 7, 27, 8, 46, 29, 286000),
'log_count/DEBUG': 8,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'log_count/WARNING': 1,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2015, 7, 27, 8, 46, 19, 528000)}
2015-07-27 16:46:29 [scrapy] INFO: Spider closed (finished)
[1]: http://i.stack.imgur.com/REQXJ.png
----------------------------------UPDATED---------------------------------------
I saw the cookies format in http://doc.scrapy.org/en/latest/topics/request-response.html. These are my cookies on the site, but I am not sure what and How I should add them along with Request.