I have been trying to understand the concept of using BaseSpider and CrawlSpider in web scrapping. I have read the docs. But there is no mention on BaseSpider. It would be really helpful to me if someone explain the differences between BaseSpider and CrawlSpider.
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
BaseSpider
is something existed before and now is deprecated (since 0.22) - use scrapy.Spider
instead:
import scrapy
class MySpider(scrapy.Spider):
# ...
scrapy.Spider
is the simplest spider that would, basically, visit the URLs defined in start_urls
or returned by start_requests()
.
Use CrawlSpider
when you need a "crawling" behavior - extracting the links and following them:
This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules. It may not be the best suited for your particular web sites or project, but it’s generic enough for several cases, so you can start from it and override it as needed for more custom functionality, or just implement your own spider.