Scrapy Pipeline - unhashable type list

I am trying to create a spider that fetches all the urls from one domain and create a record of the domain name and all the headers across the urls on this domain. This is a continuation of a previous question.

I managed to get help, and understand that I need to use Item pipeline in the scrapy framework to achieve this. I create a dict/hash in the items-pipeline where I store domain name and append all the headers.

The error I receive is: unhashable type 'list'

spider.py

class MySpider(CrawlSpider):
    name = 'Webcrawler'
    allowed_domains = ['web.aitp.se']
    start_urls = ['http://web.aitp.se/']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(), callback='parse_item'),  
        )

    def parse_item(self, response):
        domain=response.url.split("/")[2] 
        xpath = HtmlXPathSelector(response)
        loader = XPathItemLoader(item=WebsiteItem(), response=response)
        loader.add_value('domain',domain)
        loader.add_xpath('h1',("//h1/text()"))
        yield loader.load_item()

pipelines.py

# Define your item pipelines here
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    from scrapy.exceptions import DropItem
    from scrapy.http import Request
    from Prospecting.items import WebsiteItem
    from collections import defaultdict

class DomainPipeline(object):
    global Accumulator 
    Accumulator = defaultdict(list)

    def process_item(self, item, spider):
        Accumulator[ item['domain'] ].append( item['h1'] )

    def close_spider(spider):
        yield Accumulator.items()

I tried to break down the problem, and just read domains and headers from a csv-file and merge this into one record and this works fine.

from collections import defaultdict
Accumulator = defaultdict(list)
companies= open('test.csv','r')

for line in companies:

    fields=line.split(',')
    Accumulator[ fields[0] ].append(fields[1])

print Accumulator.items()

标签： python scrapy pipeline

1条回答

手持菜刀，她持情操

2楼-- · 2019-08-29 10:56

In python, a list cannot be used as key in a dict. The dict keys need to be hashable (which usually means that keys need to be immutable)

So, if there is any place where you are using lists, you can convert it into a tuple before adding to a dict. tuple(mylist) should be good enough to convert the list to a tuple.

0人赞添加讨论(0) 举报

Scrapy Pipeline - unhashable type list

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间