I wanna to fetch web pages under different domain, that means I have to use different spider under the command "scrapy crawl myspider". However, I have to use different pipeline logic to put the data into database since the content of web pages are different. But for every spider, they have to go through all of the pipelines which defined in settings.py. Is there have other elegant method to using seperate pipelines for each spider?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
A more robust solution; Can't remember where I found it but a scrapy dev proposed it somewhere.. Using this method allows you to have some pipeline run on all spiders by not using the wrapper. It also makes it so you don't have to duplicate the logic of checking whether or not to use the pipeline.
Wrapper:
Usage:
Spider usage:
ITEM_PIPELINES
setting is defined globally for all spiders in the project during the engine start. It cannot be changed per spider on the fly.Here are some options to consider:
Change the code of pipelines. Skip/continue processing items returned by spiders in the
process_item
method of your pipeline, e.g.:Change the way you start crawling. Do it from a script, based on spider name passed as a parameter, override your
ITEM_PIPELINES
setting before callingcrawler.configure()
.See also:
Hope that helps.
A slightly better version of the above is as follows. It is better because this way allows you to selectively turn on pipelines for different spiders more easily than the coding of 'not in ['spider1','spider2']' in the pipeline above.
In your spider class, add:
Then in each pipeline, you can use the
getattr
method as magic. Add: