Python: How to update value of key value pair in n

2019-02-26 04:53发布

i am trying to make an inversed document index, therefore i need to know from all unique words in a collection in which doc they occur and how often.

i have used this answer in order two create a nested dictionary. The provided solution works fine, with one problem though.

First i open the file and make a list of unique words. These unique words i than want to compare with the original file. When there is a match, the frequency counter should be updated and its value be stored in the two dimensional array.

output should eventually look like this:

word1, {doc1 : freq}, {doc2 : freq} <br>
word2, {doc1 : freq}, {doc2 : freq}, {doc3:freq}
etc....

Problem is that i cannot update the dictionary variable. When trying to do so i get the error:

  File "scriptV3.py", line 45, in main
    freq = dictionary[keyword][filename] + 1
TypeError: unsupported operand type(s) for +: 'AutoVivification' and 'int'

I think i need to cast in some way the instance of AutoVivification to int....

How to go?

thanks in advance

my code:

#!/usr/bin/env python 
# encoding: utf-8

import sys
import os
import re
import glob
import string
import sets

class AutoVivification(dict):
    """Implementation of perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value

def main():
    pad = 'temp/'
    dictionary  = AutoVivification()
    docID = 0
    for files in glob.glob( os.path.join(pad, '*.html') ):  #for all files in specified folder:
        docID = docID + 1
        filename = "doc_"+str(docID)
        text = open(files, 'r').read()                      #returns content of file as string
        text = extract(text, '<pre>', '</pre>')             #call extract function to extract text from within <pre> tags
        text = text.lower()                                 #all words to lowercase
        exclude = set(string.punctuation)                   #sets list of all punctuation characters
        text = ''.join(char for char in text if char not in exclude) # use created exclude list to remove characters from files
        text = text.split()                                 #creates list (array) from string
        uniques = set(text)                                 #make list unique (is dat handig? we moeten nog tellen)

        for keyword in uniques:                             #For every unique word do   
            for word in text:                               #for every word in doc:
                if (word == keyword and dictionary[keyword][filename] is not None): #if there is an occurence of keyword increment counter 
                    freq = dictionary[keyword][filename]    #here we fail, cannot cast object instance to integer.
                    freq = dictionary[keyword][filename] + 1
                    print(keyword,dictionary[keyword])
                else:
                    dictionary[word][filename] = 1

#extract text between substring 1 and 2 
def extract(text, sub1, sub2): 
    return text.split(sub1, 1)[-1].split(sub2, 1)[0]    

if __name__ == '__main__':
    main()

9条回答
来,给爷笑一个
2楼-- · 2019-02-26 05:34

I think you are trying to add 1 to a dictionary entry that doesn't yet exist. Your getitem method is for some reason returning a new instance of the AutoVivification class when a lookup fails. You're therefore trying to add 1 to a new instance of the class.

I think the answer is to update the getitem method so that it sets the counter to 0 if it doesn't yet exist.

class AutoVivification(dict):
    """Implementation of perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            self[item] = 0
            return 0

Hope this helps.

查看更多
来,给爷笑一个
3楼-- · 2019-02-26 05:38

One could use Python's collections.defaultdict instead of creating an AutoVivification class and then instantiating dictionary as an object of that type.

import collections
dictionary = collections.defaultdict(lambda: collections.defaultdict(int))

This will create a dictionary of dictionaries with a default value of 0. When you wish to increment an entry, use:

dictionary[keyword][filename] += 1
查看更多
闹够了就滚
4楼-- · 2019-02-26 05:41

In the AutoVivification class, you define

value = self[item] = type(self)()
return value

which returns an instance of self, which is an AutoVivification in that context. The error becomes then clear.

Are you sure you want to return an AutoVivification on any missing key query? From the code, I would assume you want to return a normal dictionary with string key and int values.

By the way, maybe you would be interested in the defaultdict class.

查看更多
Emotional °昔
5楼-- · 2019-02-26 05:44

Not sure why you need nested dicts here. In a typical index scenario you have a forward index mapping

document id -> [word_ids]

and an inverse index mapping

word_id -> [document_ids]

Not sure if this is related here but using two indexes you can perform all kind of queries very efficiently and the implementation is straight forward since you don't need to deal with nested data structures.

查看更多
戒情不戒烟
6楼-- · 2019-02-26 05:47

This AutoVivification class is not the magic you are looking for.

Check out collections.defaultdict from the standard library. Your inner dicts should be defaultdicts that default to integer values, and your outer dicts would then be defaultdicts that default to inner-dict values.

查看更多
我命由我不由天
7楼-- · 2019-02-26 05:49

It would be better to kick AutoVivification out all together, because it adds nothing.

The following line:

if (word == keyword and dictionary[keyword][filename] is not None):

Doesn't work as expected, because of the way your class works, dictionary[keyword] will always return an instance of AutoVivification, and so will dictionary[keyword][filename].

查看更多
登录 后发表回答