How to properly use Rules, restrict_xpaths to craw

2019-03-16 23:28发布

I am trying to program a crawl spider to crawl RSS feeds of a website and then parsing the meta tags of the article.

The first RSS page is a page that displays the RSS categories. I managed to extract the link because the tag is in a tag. It looks like this:

           <td class="xmlLink">
             <a href="">subject1</a>
           <td class="xmlLink">
             <a href="">subject2</a>

Once you click that link it brings you the the articles for that RSS category that looks like this:

   <li class="regularitem">
    <h4 class="itemtitle">
        <a href="">article1</a>
  <li class="regularitem">
     <h4 class="itemtitle">
        <a href="">article2</a>

As You can see I can get the link with xpath again if I use the tag I want my crawler to go to the link inside that tag and parse the meta tags for me.

Here is my crawler code:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import exampleItem

class MetaCrawl(CrawlSpider):
    name = 'metaspider'
    start_urls = [''] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
        Rule(SgmlLinkExtractor(restrict_xpaths=('//h4[@class="itemtitle"]')), callback='parse_articles')]

    def parse_articles(self, response):
        hxs = HtmlXPathSelector(response)
        meta ='//meta')
        items = []
        for m in meta:
           item = exampleItem()
           item['link'] = response.url
           item['meta_value'] ='@content').extract()
        return items

However this is the output when I run the crawler:

DEBUG: Crawled (200) <GET http://> (referer:
DEBUG: Crawled (200) <GET http://> (referer:

What am I doing wrong here? I've been reading the documentation over and over again but I feel like I keep overlooking something. Any help would be appreciated.

EDIT: Added: items.append(item) . Had forgotten it in original post. EDIT: : I've tried this as well and it resulted in the same output:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from reuters.items import exampleItem
from scrapy.http import Request

class MetaCrawl(CrawlSpider):
    name = 'metaspider'
    start_urls = [''] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(allow=[r'.*',], restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
             Rule(SgmlLinkExtractor(allow=[r'.*'], restrict_xpaths=('//h4[@class="itemtitle"]')),follow=True),]

    def parse(self, response):       
        hxs = HtmlXPathSelector(response)
        meta ='//td[@class="xmlLink"]/a/@href')
        for m in meta:
            yield Request(m.extract(), callback = self.parse_link)

    def parse_link(self, response):       
        hxs = HtmlXPathSelector(response)
        meta ='//h4[@class="itemtitle"]/a/@href')
        for m in meta:
            yield Request(m.extract(), callback = self.parse_again)    

    def parse_again(self, response):
        hxs = HtmlXPathSelector(response)
        meta ='//meta')
        items = []
        for m in meta:
            item = exampleItem()
            item['link'] = response.url
            item['meta_name'] ='@name').extract()
            item['meta_value'] ='@content').extract()
        return items

2楼-- · 2019-03-16 23:41

You've returned an empty items, you need to append item to items.
You can also yield item in the loop.

登录 后发表回答