I'm unable to crawl the whole website, Scrapy just crawls at the surface, I want to crawl deeper. Been googling for last 5-6 hrs and no help. My code below:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
class ExampleSpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
rules = [Rule(SgmlLinkExtractor(allow=()),
follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
Please help !!!!
Thanks, Abhiram
Rules short-circuit, meaning that the first rule a link satisfies will be the rule that gets applied, your second Rule (with callback) will not be called.
Change your rules to this:
When parsing the
start_urls
, deeper urls can be parsed by the taghref
. Then, deeper request can be yielded in the functionparse()
. Here is a simple example. The most important source code is shown below: