我想获得一些工作的网站地址,所以我写了scrapy蜘蛛,我想所有的值xpath://article/dl/dd/h2/a[@class="job-title"]/@href,
但是当我执行蜘蛛命令:
scrapy spider auseek -a addsthreshold=3
用来保存值的变量“网址”是空的,有人可以帮我看着办吧,
这里是我的代码:
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.conf import settings
from scrapy.mail import MailSender
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exceptions import CloseSpider
from scrapy import log
from scrapy import signals
from myProj.items import ADItem
import time
class AuSeekSpider(CrawlSpider):
name = "auseek"
result_address = []
addressCount = int(0)
addressThresh = int(0)
allowed_domains = ["seek.com.au"]
start_urls = [
"http://www.seek.com.au/jobs/in-australia/"
]
def __init__(self,**kwargs):
super(AuSeekSpider, self).__init__()
self.addressThresh = int(kwargs.get('addsthreshold'))
print 'init finished...'
def parse_start_url(self,response):
print 'This is start url function'
log.msg("Pipeline.spider_opened called", level=log.INFO)
hxs = Selector(response)
urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()
print 'urls is:',urls
print 'test element:',urls[0].encode("ascii")
for url in urls:
postfix = url.getAttribute('href')
print 'postfix:',postfix
url = urlparse.urljoin(response.url,postfix)
yield Request(url, callback = self.parse_ad)
return
def parse_ad(self, response):
print 'this is parse_ad function'
hxs = Selector(response)
item = ADItem()
log.msg("Pipeline.parse_ad called", level=log.INFO)
item['name'] = str(self.name)
item['picNum'] = str(6)
item['link'] = response.url
item['date'] = time.strftime('%Y%m%d',time.localtime(time.time()))
self.addressCount = self.addressCount + 1
if self.addressCount > self.addressThresh:
raise CloseSpider('Get enough website address')
return item
这些问题是:
urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()
网址是空的,当我试图把它打印出来,我只是无法弄清楚,为什么它不工作,我该如何改正它,感谢您的帮助。