How to scrape xml feed with xmlfeedspider

I am trying to scrape an xml file with the below format

file_sample.xml:

<rss version="2.0">
 <channel>
   <item>
       <title>SENIOR BUDGET ANALYST (new)</title>
       <link>https://hr.example.org/psp/hrapp&SeqId=1</link>
       <pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate>
       <category>All Open Jobs</category>
   </item>
   <item>
       <title>BUDGET ANALYST (healthcare)</title>
       <link>https://hr.example.org/psp/hrapp&SeqId=2</link>
       <pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate>
       <category>All category</category>
   </item>
 </channel>
</rss>

Below is my spider.py code

class TestSpider(XMLFeedSpider):
    name = "testproject"
    allowed_domains = {"www.example.com"}
    start_urls = [
        "https://www.example.com/hrapp/rss/careers_jo_rss.xml"
        ]
    iterator = 'iternodes'
    itertag = 'channel'


    def parse_node(self, response, node):
        title = node.select('item/title/text()').extract()
        link  = node.select('item/link/text()').extract()
        pubdate  = node.select('item/pubDate/text()').extract()
        category  = node.select('item/category/text()').extract()
        item = TestprojectItem()
        item['title'] = title
        item['link'] = link
        item['pubdate'] = pubdate
        item['category'] = category
        return item

Result:

2012-07-25 13:24:14+0530 [testproject] DEBUG: Scraped from <200 https://hr.templehealth.org/hrapp/rss/careers_jo_rss.xml>
    {'title': [u'SENIOR BUDGET ANALYST (hospital/healthcare)',
               u'BUDGET ANALYST'],
     'link': [u'https://hr.example.org/psp/hrapp&SeqId=1',
               u'https://hr.example.org/psp/hrapp&SeqId=2'] 
     'pubdate': [u'Wed, 18 Jul 2012 04:00:00 GMT',
               u'Wed, 18 Jul 2012 04:00:00 GMT'] 
     'category': [u'All Open Jobs',
               u'All category'] 
      }

here as u can observe from the above result, all the results from the corresponding tags are combined in to single list, but i want to map according to their individual item tag like below as we do it for html scraping.

    {'title': u'SENIOR BUDGET ANALYST (hospital/healthcare)'
     'link': u'https://hr.example.org/psp/hrapp&SeqId=1'
     'pubdate': u'Wed, 18 Jul 2012 04:00:00 GMT'
     'category': u'All Open Jobs'
      }
    {'title': u'BUDGET ANALYST'
     'link': u'https://hr.example.org/psp/hrapp&SeqId=2' 
     'pubdate': u'Wed, 18 Jul 2012 04:00:00 GMT'
     'category': u'All category'
      }

How can we scrape xml tag data according to separate main tag like item tag above.

Thanks in advance.............

标签： python xml scrapy web-crawler

3条回答

Luminary・发光体

2楼-- · 2019-06-05 06:32

Try changing your itertag from itertag = 'channel' to 'itertag = 'item'

0人赞添加讨论(0) 举报

老娘就宠你

3楼-- · 2019-06-05 06:39

I recommend the use of feedparser:

feedparser.parse(url)

results in

{'bozo': 1,
 'bozo_exception': xml.sax._exceptions.SAXParseException("EntityRef: expecting ';'\n"),
 'encoding': u'utf-8',
 'entries': [{'link': u'https://hr.example.org/psp/hrapp&SeqId=1',
   'links': [{'href': u'https://hr.example.org/psp/hrapp&SeqId=1',
     'rel': u'alternate',
     'type': u'text/html'}],
   'tags': [{'label': None, 'scheme': None, 'term': u'All Open Jobs'}],
   'title': u'SENIOR BUDGET ANALYST (new)',
   'title_detail': {'base': u'',
    'language': None,
    'type': u'text/plain',
    'value': u'SENIOR BUDGET ANALYST (new)'},
   'updated': u'Wed, 18 Jul 2012 04:00:00 GMT',
   'updated_parsed': time.struct_time(tm_year=2012, tm_mon=7, tm_mday=18, tm_hour=4, tm_min=0, tm_sec=0, tm_wday=2, tm_yday=200, tm_isdst=0)},
  {'link': u'https://hr.example.org/psp/hrapp&SeqId=2',
   'links': [{'href': u'https://hr.example.org/psp/hrapp&SeqId=2',
     'rel': u'alternate',
     'type': u'text/html'}],
   'tags': [{'label': None, 'scheme': None, 'term': u'All category'}],
   'title': u'BUDGET ANALYST (healthcare)',
   'title_detail': {'base': u'',
    'language': None,
    'type': u'text/plain',
    'value': u'BUDGET ANALYST (healthcare)'},
   'updated': u'Wed, 18 Jul 2012 04:00:00 GMT',
   'updated_parsed': time.struct_time(tm_year=2012, tm_mon=7, tm_mday=18, tm_hour=4, tm_min=0, tm_sec=0, tm_wday=2, tm_yday=200, tm_isdst=0)}],
 'feed': {},
 'namespaces': {},
 'version': u'rss20'}

0人赞添加讨论(0) 举报

萌系小妹纸

4楼-- · 2019-06-05 06:43

Just change itertag = 'item'.

If you refer to the documentation of parse_node method, it states that the method is called for the nodes matching the provided tag name (itertag). In you case it is 'item'(child node to 'channel' rootnode).

0人赞添加讨论(0) 举报

How to scrape xml feed with xmlfeedspider

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间