I have two spiders that take urls and data scraped by a main spider. My approach to this was to use CrawlerProcess in the main spider and passing data to the two spiders. Here's my approach:
class LightnovelSpider(scrapy.Spider):
name = "novelDetail"
allowed_domains = ["readlightnovel.com"]
def __init__(self,novels = []):
self.novels = novels
def start_requests(self):
for novel in self.novels:
self.logger.info(novel)
request = scrapy.Request(novel, callback=self.parseNovel)
yield request
def parseNovel(self, response):
#stuff here
class chapterSpider(scrapy.Spider):
name = "chapters"
#not done here
class initCrawler(scrapy.Spider):
name = "main"
fromMongo = {}
toChapter = {}
toNovel = []
fromScraper = []
def start_requests(self):
urls = ['http://www.readlightnovel.com/novel-list']
for url in urls:
yield scrapy.Request(url=url,callback=self.parse)
def parse(self,response):
for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract():
initCrawler.fromScraper.append(novel)
self.checkchanged()
def checkchanged(self):
#some scraped data processing here
self.dispatchSpiders()
def dispatchSpiders(self):
process = CrawlerProcess()
novelSpider = LightnovelSpider()
process.crawl(novelSpider,novels=initCrawler.toNovel)
process.start()
self.logger.info("Main Spider Finished")
I run "scrapy crawl main" and get a beautiful error
The main error i can see is a "twisted.internet.error.ReactorAlreadyRunning" . Which i have no idea about. Are there better approaches running multiple spiders from another and/or how can i stop this error?
Wow, didn't know something like this could work, but I never tried.
What I'm doing instead when multiple scraping stages have to work hand in hand is either one of these two options:
Option 1 - Use a database
When the scrapers have to run in a continuous mode, rescanning sites etc, I would just make the scrapers push their results into a database (through a pipline)
And also the spiders that do the subsequent processing would pull the data they need from the same database (in your case the novel urls for example).
Then keep everything running using a scheduler or cron and the spiders will work hand in hand.
Option 2 - Merging everything into one spider
That's the way I choose when everything needs to be run as one piece of script: I create one spider that chains multiple request steps together.
(code is not tested, it's just to show the basic idea)
If things get too complex I pull out some elements and move them into mixin classes.
In your case I would most probably prefer option 2.
After a some research i was able to solve this problem by using a property decorator "@property" to retrieve data from main spider like this:
Then used CrawlerRunner like this: