I'm trying to limit Scrapy to a particular XPath location for following links. The XPath is correct (according to XPath Helper plugin for chrome), but when I run my Crawl Spider I get a syntax error at my Rule.
My Spider code is:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import BassItem
import logging
from scrapy.log import ScrapyFileLogObserver
logfile = open('testlog.log', 'w')
log_observer = ScrapyFileLogObserver(logfile, level=logging.DEBUG)
log_observer.start()
class BassSpider(CrawlSpider):
name = "bass"
allowed_domains = ["talkbass.com"]
start_urls = ["http://www.talkbass.com/forum/f126"]
rules = [Rule(SgmlLinkExtractor(allow=['/f126/index*']), callback='parse_item', follow=True, restrict_xpaths=('//a[starts-with(@title,"Next ")]')]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
ads = hxs.select('//table[@id="threadslist"]/tbody/tr/td[@class="alt1"][2]/div')
items = []
for ad in ads:
item = BassItem()
item['title'] = ad.select('a/text()').extract()
item['link'] = ad.select('a/@href').extract()
items.append(item)
return items
So inside the rule, the XPath '//a[starts-with(@title,"Next ")]' is returning an error and I'm not sure why, since the actual XPath is valid. I'm simply trying to get the spider to crawl each "Next Page" link. Can anyone help me out. Please let me know if you need any other parts of my code for help.
It's not the xpath that is the issue, rather that the syntax of the complete rule is incorrect. The following rule fixes the syntax error, but should be checked to make sure that it is doing what is required:
As a general point, posting the actual error in a question is highly recommended since the perception of the error and the actual error may well differ.