我是新来Scrapy和我的工作一刮锻炼和我现在用的是CrawlSpider。 尽管Scrapy框架精美的作品,并遵循相关的链接,我似乎无法使CrawlSpider刮第一个链接(首页/着陆页)。 相反,它直接进入凑规则所确定的链接,但不刮登陆页面上的链接。 我不知道如何解决这个问题,因为不建议覆盖了CrawlSpider解析方法。 修改如下=真/假也不会产生任何好的结果。 下面是代码片段:
class DownloadSpider(CrawlSpider):
name = 'downloader'
allowed_domains = ['bnt-chemicals.de']
start_urls = [
"http://www.bnt-chemicals.de"
]
rules = (
Rule(SgmlLinkExtractor(aloow='prod'), callback='parse_item', follow=True),
)
fname = 1
def parse_item(self, response):
open(str(self.fname)+ '.txt', 'a').write(response.url)
open(str(self.fname)+ '.txt', 'a').write(','+ str(response.meta['depth']))
open(str(self.fname)+ '.txt', 'a').write('\n')
open(str(self.fname)+ '.txt', 'a').write(response.body)
open(str(self.fname)+ '.txt', 'a').write('\n')
self.fname = self.fname + 1
只要改变你的回调parse_start_url
并重写它:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class DownloadSpider(CrawlSpider):
name = 'downloader'
allowed_domains = ['bnt-chemicals.de']
start_urls = [
"http://www.bnt-chemicals.de",
]
rules = (
Rule(SgmlLinkExtractor(allow='prod'), callback='parse_start_url', follow=True),
)
fname = 0
def parse_start_url(self, response):
self.fname += 1
fname = '%s.txt' % self.fname
with open(fname, 'w') as f:
f.write('%s, %s\n' % (response.url, response.meta.get('depth', 0)))
f.write('%s\n' % response.body)
有许多这样做的方法,但最简单的一个是实施parse_start_url
然后修改start_urls
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class DownloadSpider(CrawlSpider):
name = 'downloader'
allowed_domains = ['bnt-chemicals.de']
start_urls = ["http://www.bnt-chemicals.de/tunnel/index.htm"]
rules = (
Rule(SgmlLinkExtractor(allow='prod'), callback='parse_item', follow=True),
)
fname = 1
def parse_start_url(self, response):
return self.parse_item(response)
def parse_item(self, response):
open(str(self.fname)+ '.txt', 'a').write(response.url)
open(str(self.fname)+ '.txt', 'a').write(','+ str(response.meta['depth']))
open(str(self.fname)+ '.txt', 'a').write('\n')
open(str(self.fname)+ '.txt', 'a').write(response.body)
open(str(self.fname)+ '.txt', 'a').write('\n')
self.fname = self.fname + 1