Crawling local files with Scrapy without an active

2019-07-29 12:46发布

问题:

Is it possible to crawl local files with Scrapy 0.18.4 without having an active project? I've seen this answer and it looks promising, but to use the crawl command you need a project.

Alternatively, is there an easy/minimalist way to set up a project for an existing spider? I have my spider, pipelines, middleware, and items defined in one Python file. I've created a scrapy.cfg file with only the project name. This lets me use crawl, but since I don't have a spiders folder Scrapy can't find my spider. Can I point Scrapy to the right directory, or do I need to split my items, spider, etc. up into separate files?

[edit] I forgot to say that I'm running the spider using Crawler.crawl(my_spider) - ideally I'd still like to be able to run the spider like that, but can run it in a subprocess from my script if that's not possible.

Turns out the suggestion in the answer I linked does work - http://localhost:8000 can be used as a start_url, so there's no need for a project.

回答1:

As an option, you can run Scrapy from a script, here is a self-contained example script and the overview of the approach used.

This doesn't mean you have to put everything in one file. You can still have spider.py, items.py, pipelines.py - just import them correctly in the script you start crawling from.