I want to crawl a website with 2 parts and my script is not as fast as I need.
Is it possible to launch 2 spiders, one for scraping the first part and the second one for the second part?
I tried to have 2 different classes, and run them
scrapy crawl firstSpider
scrapy crawl secondSpider
but i think that it is not smart.
I read the documentation of scrapyd but I don't know if it's good for my case.
I think what you are looking for is something like this:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
You can read more at: running-multiple-spiders-in-the-same-process.
Or you can run with like this, you need to save this code at the same directory with scrapy.cfg (My scrapy version is 1.3.3) :
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
process = CrawlerProcess(setting)
for spider_name in process.spiders.list():
print ("Running spider %s" % (spider_name))
process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
process.start()
Better solution is (if you have multiple spiders) it dynamically get spiders and run them.
from scrapy import spiderloader
from scrapy.utils import project
from twisted.internet.defer import inlineCallbacks
@inlineCallbacks
def crawl():
settings = project.get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
spiders = spider_loader.list()
classes = [spider_loader.load(name) for name in spiders]
for my_spider in classes:
yield runner.crawl(my_spider)
reactor.stop()
crawl()
reactor.run()
(Second Solution):
Because spiders.list()
is deprecated in Scrapy 1.4 Yuda solution should be converted to something like
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
for spider_name in spider_loader.list():
print ("Running spider %s" % (spider_name))
process.crawl(spider_name)
process.start()