Confused about running Scrapy from within a Python

2019-01-24 14:09发布

问题:

Following document, I can run scrapy from a Python script, but I can't get the scrapy result.

This is my spider:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from items import DmozItem

class DmozSpider(BaseSpider):
    name = "douban" 
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/group/xxx/discussion"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select("//table[@class='olt']/tr/td[@class='title']/a")
        items = []
        # print sites
        for row in rows:
            item = DmozItem()
            item["title"] = row.select('text()').extract()[0]
            item["link"] = row.select('@href').extract()[0]
            items.append(item)

        return items

Notice the last line, I try to use the returned parse result, if I run:

 scrapy crawl douban

the terminal could print the return result

But I can't get the return result from the Python script. Here is my Python script:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from spiders.dmoz_spider import DmozSpider
from scrapy.xlib.pydispatch import dispatcher

def stop_reactor():
    reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = DmozSpider(domain='www.douban.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg("------------>Running reactor")
result = reactor.run()
print result
log.msg("------------>Running stoped")

I try to get the result at the reactor.run(), but it return nothing,

How can I get the result?

回答1:

Terminal prints the result because the default log level is set to DEBUG.

When you are running your spider from the script and call log.start(), the default log level is set to INFO.

Just replace:

log.start()

with

log.start(loglevel=log.DEBUG)

UPD:

To get the result as string, you can log everything to a file and then read from it, e.g.:

log.start(logfile="results.log", loglevel=log.DEBUG, crawler=crawler, logstdout=False)

reactor.run()

with open("results.log", "r") as f:
    result = f.read()
print result

Hope that helps.



回答2:

I found your question while asking myself the same thing, namely: "How can I get the result?". Since this wasn't answered here I endeavoured to find the answer myself and now that I have I can share it:

items = []
def add_item(item):
    items.append(item)
dispatcher.connect(add_item, signal=signals.item_passed)

Or for scrapy 0.22 (http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script) replace the last line of my solution by:

crawler.signals.connect(add_item, signals.item_passed)

My solution is freely adapted from http://www.tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/.



回答3:

in my case, i placed the script file at scrapy project level e.g. if scrapyproject/scrapyproject/spiders then i placed it at scrapyproject/myscript.py