Scrapy can not scrape a second page using itemload

Update: 7/29, 9:29pm: After reading this post, I updated my code.

UPDATE: 7/28/15, at 7:35pm, following Martin's suggestion, the message changed, but still no listing of items or writing to database.

ORIGINAL: I can successfully scrape a single page (the base page). Now I tried to scrape one of the items from another url found from the "base" page, using Request and callback command. But it does not work. The spider is here:

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy import Request
import re
from datetime import datetime, timedelta
from CAPjobs.items import CAPjobsItem 
from CAPjobs.items import CAPjobsItemLoader
from scrapy.contrib.loader.processor import MapCompose, Join

class CAPjobSpider(Spider):
    name = "naturejob3"
    download_delay = 2
    #allowed_domains = ["nature.com/naturejobs/"]
    start_urls = [
    "http://www.nature.com/naturejobs/science/jobs?utf8=%E2%9C%93&q=pathologist&where=&commit=Find+Jobs"]

    def parse_subpage(self, response):
        il = response.meta['il']
        il.add_xpath('loc_pj', '//div[@id="extranav"]/div/dl/dd[2]/ul/li/text()')  
        yield il.load_item()

    def parse(self, response):
        hxs = Selector(response)
        sites = hxs.xpath('//div[@class="job-details"]')    

        for site in sites:

            il = CAPjobsItemLoader(CAPjobsItem(), selector = site) 
            il.add_xpath('title', 'h3/a/text()')
            il.add_xpath('post_date', 'normalize-space(ul/li[@class="when"]/text())')
            il.add_xpath('web_url', 'concat("http://www.nature.com", h3/a/@href)')
            url = il.get_output_value('web_url')
            yield Request(url, meta={'il': il}, callback=self.parse_subpage)

Now the scraping is partially functioning, but no loc_pj item: (UPDATE on 7/29, 7:35pm)

2015-07-29 21:28:24 [scrapy] DEBUG: Scraped from <200 http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000>
{'post_date': u'21 days ago',
'title': u'Assistant, Associate, Full (HS Clinical, Clin X) - Anatomic Pathology/Cytopathology (11-000)',
'web_url': u'http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000'}

标签： python-2.7 scrapy scrapy-spider

1条回答

倾城　Initia

2楼-- · 2019-06-01 02:57

You initialize the ItemLoader like so:

il = CAPjobsItemLoader(CAPjobsItem, sites)

In the documentation it is done like so:

l = ItemLoader(item=Product(), response=response)

So I think you're missing parentheses at the CAPjobsItem and your line should read:

il = CAPjobsItemLoader(CAPjobsItem(), sites)

0人赞添加讨论(0) 举报

Scrapy can not scrape a second page using itemload

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间