从没有创造JSON输出文件的Python脚本调用scrapy(Calling scrapy from

2019-07-21 21:50发布

下面是Python脚本,我使用调用scrapy的答案

从剧本总是块脚本执行经过刮Scrapy抓取

def stop_reactor():
    reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider(start_url='abc')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run()  # the script will block here until the spider is closed
log.msg('Reactor stopped.')

这里是我的pipelines.py代码

from scrapy import log,signals
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.xlib.pydispatch import dispatcher

class scrapermar11Pipeline(object):


    def __init__(self):
        self.files = {}
        dispatcher.connect(self.spider_opened , signals.spider_opened)
        dispatcher.connect(self.spider_closed , signals.spider_closed)


    def spider_opened(self,spider):
        file = open('links_pipelines.json' ,'wb')
        self.files[spider] = file
        self.exporter = JsonItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self,spider):
       self.exporter.finish_exporting()
       file = self.files.pop(spider)
       file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        log.msg('It reached here')
        return item

此代码是从这里取

Scrapy ::使用JSON出口问题

当我运行像这样的履带式

scrapy crawl MySpider -a start_url='abc'

与预期输出链接文件被创建。但是当我执行python脚本虽然履带运行作为倾倒scrapy统计类似于以前的运行它不会创建任何文件。 我认为,正如在第一种方法得到创建的文件。如何做我拿到剧本到输出的文件有一个错误在python脚本?

Answer 1:

此代码为我工作:

from scrapy import signals, log
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.http import Request
from multiprocessing.queues import Queue
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process
# import your spider here
def handleSpiderIdle(spider):
        reactor.stop()
mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': '<name of your project>.pipelines.scrapermar11Pipeline'} 

settings.overrides.update(mySettings)

crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()

spider = <nameofyourspider>(domain="") # create a spider ourselves
crawlerProcess.crawl(spider) # add it to spiders pool

dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?)

log.start() # depends on LOG_ENABLED
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."


Answer 2:

为我工作溶液中沟运行脚本,并使用内部API和使用命令行和GNU并行,而不是并行。

要运行所有已知的蜘蛛,每个核心之一:

scrapy list | parallel --line-buffer scrapy crawl

scrapy list中列出了所有的蜘蛛,每行一个,允许我们管它们作为参数追加到命令( scrapy crawl )传递到GNU并行来代替。 --line-buffer意味着输出接收到来自进程后面将被打印到stdout混合,但在一行接一行的基础上,而不是四/半行被乱码在一起(其他选项看--group--ungroup )。

注:显然,这适用于具有多个CPU内核的机器最好的,因为在默认情况下,GNU并行运行每个内核一个作业。 请注意,与许多现代的开发机,便宜的AWS EC2和DigitalOcean层只有一个虚拟CPU核心。 因此,如果你希望在一个内核上同时运行的作业,你会与打--jobs参数GNU中并行。 例如,运行每个核心2种scrapy爬虫:

scrapy list | parallel --jobs 200% --line-buffer scrapy crawl


文章来源: Calling scrapy from a python script not creating JSON output file