Hi I have Python Scrapy installed on my mac and I was trying to follow the very first example on their web.
They were trying to run the command:
scrapy crawl mininova.org -o scraped_data.json -t json
I don't quite understand what does this mean? looks like scrapy turns out to be a separate program. And I don't think they have a command called crawl. In the example, they have a paragraph of code, which is the definition of the class MininovaSpider and the TorrentItem. I don't know where these two classes should go to, go to the same file and what is the name of this python file?
TL;DR: see Self-contained minimum example script to run scrapy.
First of all, having a normal Scrapy project with a separate
.cfg
,settings.py
,pipelines.py
,items.py
,spiders
package etc is a recommended way to keep and handle your web-scraping logic. It provides a modularity, separation of concerns that keeps things organized, clear and testable.If you are following the official Scrapy tutorial to create a project, you are running web-scraping via a special
scrapy
command-line tool:But,
Scrapy
also provides an API to run crawling from a script.There are several key concepts that should be mentioned:
Settings
class - basically a key-value "container" which is initialized with default built-in valuesCrawler
class - the main class that acts like a glue for all the different components involved in web-scraping with Scrapyreactor
- since Scrapy is built-in on top oftwisted
asynchronous networking library - to start a crawler, we need to put it inside the Twisted Reactor, which is in simple words, an event loop:Here is a basic and simplified process of running Scrapy from script:
create a
Settings
instance (or useget_project_settings()
to use existing settings):instantiate
Crawler
withsettings
instance passed in:instantiate a spider (this is what it is all about eventually, right?):
configure signals. This is an important step if you want to have a post-processing logic, collect stats or, at least, to ever finish crawling since the twisted
reactor
needs to be stopped manually. Scrapy docs suggest to stop thereactor
in thespider_closed
signal handler:configure and start crawler instance with a spider passed in:
optionally start logging:
start the reactor - this would block the script execution:
Here is an example self-contained script that is using
DmozSpider
spider and involves item loaders with input and output processors and item pipelines:Run it in a usual way:
and observe items exported to
items.jl
with the help of the pipeline:Gist is available here (feel free to improve):
Notes:
If you define
settings
by instantiating aSettings()
object - you'll get all the defaults Scrapy settings. But, if you want to, for example, configure an existing pipeline, or configure aDEPTH_LIMIT
or tweak any other setting, you need to either set it in the script viasettings.set()
(as demonstrated in the example):or, use an existing
settings.py
with all the custom settings preconfigured:Other useful links on the subject:
You may have better luck looking through the tutorial first, as opposed to the "Scrapy at a glance" webpage.
The tutorial implies that Scrapy is, in fact, a separate program.
Running the command
scrapy startproject tutorial
will create a folder calledtutorial
several files already set up for you.For example, in my case, the modules/packages
items
,pipelines
,settings
andspiders
have been added to the root packagetutorial
.The
TorrentItem
class would be placed insideitems.py
, and theMininovaSpider
class would go inside thespiders
folder.Once the project is set up, the command-line parameters for Scrapy appear to be fairly straightforward. They take the form:
Alternatively, if you want to run scrapy without the overhead of creating a project directory, you can use the runspider command: