Failed to crawl element of specific website with s

I want to get website addresses of some jobs, so I write a scrapy spider, I want to get all of the value with xpath://article/dl/dd/h2/a[@class="job-title"]/@href, but when I execute the spider with command :

scrapy spider auseek -a addsthreshold=3

the variable "urls" used to preserve values is empty, can someone help me to figure it,

here is my code:

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.conf import settings
from scrapy.mail import MailSender
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exceptions import CloseSpider
from scrapy import log
from scrapy import signals

from myProj.items import ADItem
import time

class AuSeekSpider(CrawlSpider):
    name = "auseek"
    result_address = []
    addressCount = int(0)
    addressThresh = int(0)
    allowed_domains = ["seek.com.au"]
    start_urls = [
        "http://www.seek.com.au/jobs/in-australia/"
    ]

    def __init__(self,**kwargs):
        super(AuSeekSpider, self).__init__()
        self.addressThresh = int(kwargs.get('addsthreshold'))
        print 'init finished...'

    def parse_start_url(self,response):
        print 'This is start url function'
        log.msg("Pipeline.spider_opened called", level=log.INFO)
        hxs = Selector(response)
        urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()
        print 'urls is:',urls
        print 'test element:',urls[0].encode("ascii")
        for url in urls:
            postfix = url.getAttribute('href')
            print 'postfix:',postfix
            url = urlparse.urljoin(response.url,postfix)
            yield Request(url, callback = self.parse_ad)

        return 


    def parse_ad(self, response):
        print 'this is parse_ad function'
        hxs = Selector(response) 

        item = ADItem()
        log.msg("Pipeline.parse_ad called", level=log.INFO)
        item['name'] = str(self.name)
        item['picNum'] = str(6)
        item['link'] = response.url
        item['date'] = time.strftime('%Y%m%d',time.localtime(time.time()))

        self.addressCount = self.addressCount + 1
        if self.addressCount > self.addressThresh:
            raise CloseSpider('Get enough website address')
        return item

The problems is:

urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()

urls is empty when I tried to print it out, I just cant figure out why it doesn't work and how can I correct it, thanks for your help.

标签： python scrapy element web-crawler

2条回答

仙女界的扛把子

2楼-- · 2019-08-01 10:41

Scrapy does not evaluate Javascript. If you run the following command, you will see that the raw HTML does not contain the anchors you are looking for.

curl http://www.seek.com.au/jobs/in-australia/ | grep job-title

You should try PhantomJS or Selenium instead.

After examining the network requests in Chrome, the job listing appear to have originated from this JSONP request. It should be easy to retrieve whatever you need from it.

0人赞添加讨论(0) 举报

我想做一个坏孩纸

3楼-- · 2019-08-01 10:48

Here is a working example using selenium and phantomjs headless webdriver in a download handler middleware.

class JsDownload(object):

@check_spider_middleware
def process_request(self, request, spider):
    driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
    driver.get(request.url)
    return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

I wanted to ability to tell different spiders which middleware to use so I implemented this wrapper:

def check_spider_middleware(method):
@functools.wraps(method)
def wrapper(self, request, spider):
    msg = '%%s %s middleware step' % (self.__class__.__name__,)
    if self.__class__ in spider.middleware:
        spider.log(msg % 'executing', level=log.DEBUG)
        return method(self, request, spider)
    else:
        spider.log(msg % 'skipping', level=log.DEBUG)
        return None

return wrapper

settings.py:

DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}

for wrapper to work all spiders must have at minimum:

middleware = set([])

to include a middleware:

middleware = set([MyProj.middleware.ModuleName.ClassName])

You could have implemented this in a request callback (in spider) but then the http request would be happening twice. This isn't a full proof solution but it works for stuff that loads on .ready(). If you spend some time reading into selenium you can wait for specific event's to trigger before saving page source.

Another example: https://github.com/scrapinghub/scrapyjs

More info: What's the best way of scraping data from a website?

Cheers!

0人赞添加讨论(0) 举报

Failed to crawl element of specific website with s

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间