Scrapy Pipeline - unhashable type list

2019-08-29 10:27发布

I am trying to create a spider that fetches all the urls from one domain and create a record of the domain name and all the headers across the urls on this domain. This is a continuation of a previous question.

I managed to get help, and understand that I need to use Item pipeline in the scrapy framework to achieve this. I create a dict/hash in the items-pipeline where I store domain name and append all the headers.

The error I receive is: unhashable type 'list'

spider.py

class MySpider(CrawlSpider):
    name = 'Webcrawler'
    allowed_domains = ['web.aitp.se']
    start_urls = ['http://web.aitp.se/']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(), callback='parse_item'),  
        )

    def parse_item(self, response):
        domain=response.url.split("/")[2] 
        xpath = HtmlXPathSelector(response)
        loader = XPathItemLoader(item=WebsiteItem(), response=response)
        loader.add_value('domain',domain)
        loader.add_xpath('h1',("//h1/text()"))
        yield loader.load_item()

pipelines.py

# Define your item pipelines here
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    from scrapy.exceptions import DropItem
    from scrapy.http import Request
    from Prospecting.items import WebsiteItem
    from collections import defaultdict

class DomainPipeline(object):
    global Accumulator 
    Accumulator = defaultdict(list)

    def process_item(self, item, spider):
        Accumulator[ item['domain'] ].append( item['h1'] )

    def close_spider(spider):
        yield Accumulator.items()

I tried to break down the problem, and just read domains and headers from a csv-file and merge this into one record and this works fine.

from collections import defaultdict
Accumulator = defaultdict(list)
companies= open('test.csv','r')

for line in companies:

    fields=line.split(',')
    Accumulator[ fields[0] ].append(fields[1])

print Accumulator.items()

1条回答
手持菜刀,她持情操
2楼-- · 2019-08-29 10:56

In python, a list cannot be used as key in a dict. The dict keys need to be hashable (which usually means that keys need to be immutable)

So, if there is any place where you are using lists, you can convert it into a tuple before adding to a dict. tuple(mylist) should be good enough to convert the list to a tuple.

查看更多
登录 后发表回答