I have made a Scrapy spider that can be successfully run from a script located in the root directory of the project. As I need to run multiple spiders from different projects from the same script (this will be a django app calling the script upon the user's request), I moved the script from the root of one of the projects to the parent directory. For some reason, the script is no longer able to get the project's custom settings in order to pipeline the scraped results into the database tables. Here is the code from the scrapy docs I'm using to run the spider from a script:
def spiderCrawl():
settings = get_project_settings()
settings.set('USER_AGENT','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)')
process = CrawlerProcess(settings)
process.crawl(MySpider3)
process.start()
Is there some extra module that needs to be imported in order to get the project settings from outside of the project? Or does there need to be some additions made to this code? Below I also have the code for the script running the spiders, thanks.
from ticket_city_scraper.ticket_city_scraper import *
from ticket_city_scraper.ticket_city_scraper.spiders import tc_spider
from vividseats_scraper.vividseats_scraper import *
from vividseats_scraper.vividseats_scraper.spiders import vs_spider
tc_spider.spiderCrawl()
vs_spider.spiderCrawl()
It should work , can you share your scrapy log file
Edit: your approach will not work because ...when you execute the script..it will look for your default settings in
Solution 1 create a cfg file inside the directory (outside folder) and give it a path to the valid settings.py file
Solution 2 make your parent directory package , so that absolute path will not be required and you can use relative path
i.e python -m cron.project1
Solution 3
Also you can try something like
Let it be where it is , inside the project directory..where it is working...
Create a sh file...
Now you can execute spiders via this sh file when requested by django
Thanks to some of the answers already provided here, I realised scrapy wasn't actually importing the settings.py file. This is how I fixed it.
TLDR: Make sure you set the 'SCRAPY_SETTINGS_MODULE' variable to your actual settings.py file. I'm doing this in the __init__() func of Scraper.
Consider a project with the following structure.
Basically, the command
scrapy startproject scraper
was executed in the my_project folder, I've added arun_scraper.py
file to the outer scraper folder, amain.py
file to my root folder, andquotes_spider.py
to the spiders folder.My main file:
My
run_scraper.py
file:Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper). So in my case:
And repeat for all the settings variables you have!
this could happen because you are no longer "inside" a scrapy project, so it doesn't know how to get the settings with
get_project_settings()
.You can also specify the settings as a dictionary as the example here:
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
I have used this code to solve the problem: