Scrapy outputs [ into my .json file

A genuine Scrapy and Python noob here so please be patient with any silly mistakes. I'm trying to write a spider to recursively crawl a news site and return the headline, date, and first paragraph of the Article. I managed to crawl a single page for one item but the moment I try and expand beyond that it all goes wrong.

my Spider:

import scrapy
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.selector import Selector
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from basic.items import BasicItem

    class BasicSpiderSpider(CrawlSpider):
        name = "basic_spider"
        allowed_domains = ["news24.com/"]
        start_urls = (
        'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328',
        )

        rules = (Rule (SgmlLinkExtractor(allow=("", ))
        , callback="parse_items", follow= True),
        )
        def parse_items(self, response):
            hxs = Selector(response)
            titles = hxs.xpath('//*[@id="aspnetForm"]')
            items = []
            item = BasicItem()
            item['Headline'] = titles.xpath('//*[@id="article_special"]//h1/text()').extract()
            item["Article"] = titles.xpath('//*[@id="article-body"]/p[1]/text()').extract()
            item["Date"] = titles.xpath('//*[@id="spnDate"]/text()').extract()
            items.append(item)
            return items

I am still getting the same problem, though have noticed that there is a "[" for every time I try and run the spider, to try and figure out what the issue is I have run the following command:

c:\Scrapy Spiders\basic>scrapy parse --spider=basic_spider -c parse_items -d 2 -v http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328

which gives me the following output:

2015-03-30 15:28:21+0200 [scrapy] INFO: Scrapy 0.24.5 started (bot: basic)
2015-03-30 15:28:21+0200 [scrapy] INFO: Optional features available: ssl, http11
2015-03-30 15:28:21+0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'basic.spiders', 'SPIDER_MODULES': ['basic.spiders'], 'DEPTH_LIMIT': 1, 'DOW
NLOAD_DELAY': 2, 'BOT_NAME': 'basic'}
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, D
efaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddl
eware
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled item pipelines:
2015-03-30 15:28:21+0200 [basic_spider] INFO: Spider opened
2015-03-30 15:28:21+0200 [basic_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-03-30 15:28:22+0200 [basic_spider] DEBUG: Crawled (200) <GET http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328>
 (referer: None)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Closing spider (finished)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 282,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 145301,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 3, 30, 13, 28, 22, 177000),
         'log_count/DEBUG': 3,
         'log_count/INFO': 7,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2015, 3, 30, 13, 28, 21, 878000)}
2015-03-30 15:28:22+0200 [basic_spider] INFO: Spider closed (finished)

>>> DEPTH LEVEL: 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'Article': [u'Johannesburg - Fifty-six children were taken to\nPietermaritzburg hospitals after showing signs of food poisoning while at\nschool, KwaZulu-Na
tal emergency services said on Friday.'],
  'Date': [u'2015-03-28 07:30'],
  'Headline': [u'56 children hospitalised for food poisoning']}]
# Requests  -----------------------------------------------------------------
[]

So, I can see that the Item is being scraped, but there is no usable item data put into the json file. this is how i'm running scrapy:

scrapy crawl basic_spider -o test.json

I've been looking at the last line, (return items) as changing it to either yield or print gives me no items scraped in the parse.

标签： python json scrapy scrapy-spider

1条回答

The star\"

2楼-- · 2019-07-20 20:55

This usually means nothing was scraped, no items were extracted.

In your case, fix your allowed_domains setting:

allowed_domains = ["news24.com"]

Aside from that, just a bit cleaning up from a perfectionist:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class BasicSpiderSpider(CrawlSpider):
    name = "basic_spider"
    allowed_domains = ["news24.com"]
    start_urls = [
        'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328',
    ]

    rules = [
        Rule(LinkExtractor(), callback="parse_items", follow=True),
    ]

    def parse_items(self, response):
        for title in response.xpath('//*[@id="aspnetForm"]'):
            item = BasicItem()
            item['Headline'] = title.xpath('//*[@id="article_special"]//h1/text()').extract()
            item["Article"] = title.xpath('//*[@id="article-body"]/p[1]/text()').extract()
            item["Date"] = title.xpath('//*[@id="spnDate"]/text()').extract()
            yield item

0人赞添加讨论(0) 举报

Scrapy outputs [ into my .json file

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间