无法抓取特定网站的元素与scrapy蜘蛛(Failed to crawl element of sp

2019-10-20 08:57发布

我想获得一些工作的网站地址,所以我写了scrapy蜘蛛,我想所有的值xpath://article/dl/dd/h2/a[@class="job-title"]/@href,但是当我执行蜘蛛命令:

scrapy spider auseek -a addsthreshold=3

用来保存值的变量“网址”是空的,有人可以帮我看着办吧,

这里是我的代码:

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.conf import settings
from scrapy.mail import MailSender
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exceptions import CloseSpider
from scrapy import log
from scrapy import signals

from myProj.items import ADItem
import time

class AuSeekSpider(CrawlSpider):
    name = "auseek"
    result_address = []
    addressCount = int(0)
    addressThresh = int(0)
    allowed_domains = ["seek.com.au"]
    start_urls = [
        "http://www.seek.com.au/jobs/in-australia/"
    ]

    def __init__(self,**kwargs):
        super(AuSeekSpider, self).__init__()
        self.addressThresh = int(kwargs.get('addsthreshold'))
        print 'init finished...'

    def parse_start_url(self,response):
        print 'This is start url function'
        log.msg("Pipeline.spider_opened called", level=log.INFO)
        hxs = Selector(response)
        urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()
        print 'urls is:',urls
        print 'test element:',urls[0].encode("ascii")
        for url in urls:
            postfix = url.getAttribute('href')
            print 'postfix:',postfix
            url = urlparse.urljoin(response.url,postfix)
            yield Request(url, callback = self.parse_ad)

        return 


    def parse_ad(self, response):
        print 'this is parse_ad function'
        hxs = Selector(response) 

        item = ADItem()
        log.msg("Pipeline.parse_ad called", level=log.INFO)
        item['name'] = str(self.name)
        item['picNum'] = str(6)
        item['link'] = response.url
        item['date'] = time.strftime('%Y%m%d',time.localtime(time.time()))

        self.addressCount = self.addressCount + 1
        if self.addressCount > self.addressThresh:
            raise CloseSpider('Get enough website address')
        return item

这些问题是:

urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()

网址是空的,当我试图把它打印出来,我只是无法弄清楚,为什么它不工作,我该如何改正它,感谢您的帮助。

Answer 1:

Scrapy不评估的Javascript。 如果您运行下面的命令,你会看到原始的HTML不包含你正在寻找的锚。

curl http://www.seek.com.au/jobs/in-australia/ | grep job-title

你应该尝试PhantomJS或硒代替。

在Chrome中检查网络请求后,上市工作似乎都源于此JSONP请求 。 它应该很容易检索无论你需要它。



Answer 2:

下面是使用硒工作示例,并在下载处理中间件phantomjs无头的webdriver。

class JsDownload(object):

@check_spider_middleware
def process_request(self, request, spider):
    driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
    driver.get(request.url)
    return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

我想能力判断哪些中间件来使用,所以我实现了这个包装不同的蜘蛛:

def check_spider_middleware(method):
@functools.wraps(method)
def wrapper(self, request, spider):
    msg = '%%s %s middleware step' % (self.__class__.__name__,)
    if self.__class__ in spider.middleware:
        spider.log(msg % 'executing', level=log.DEBUG)
        return method(self, request, spider)
    else:
        spider.log(msg % 'skipping', level=log.DEBUG)
        return None

return wrapper

settings.py:

DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}

用于包装工作所有的蜘蛛必须以最小的:

middleware = set([])

包括中间件:

middleware = set([MyProj.middleware.ModuleName.ClassName])

你可以在请求回调(在蜘蛛)已经实现了这一点,但随后的HTTP请求将发生两次。 这是不是一个完整的需求的解决方案,但它工作的东西,对。就绪负载()。 如果你花一些时间阅读到硒可以等待特定事件的节约页面源之前触发。

又如: https://github.com/scrapinghub/scrapyjs

更多信息: 什么是从一个网站刮数据的最佳方式?

干杯!



文章来源: Failed to crawl element of specific website with scrapy spider