可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to use Pickle to save a dictionary in a file. The code to save the dictionary runs without any problems, but when I try to retrieve the dictionary from the file in the Python shell, I get an EOF error:

>>> import pprint
>>> pkl_file = open('data.pkl', 'rb')
>>> data1 = pickle.load(pkl_file)
 Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/lib/python2.7/pickle.py", line 1378, in load
     return Unpickler(file).load()
     File "/usr/lib/python2.7/pickle.py", line 858, in load
      dispatch[key](self)
      File "/usr/lib/python2.7/pickle.py", line 880, in load_eof
      raise EOFError
      EOFError

My code is below.

It counts the frequency of each word and the date of the data (the date is the file name.) then saves words as keys of dictionary and the tuple of (freq,date) as values of each key. Now I want to use this dictionary as the input of another part of my work :

def pathFilesList():
    source='StemmedDataset'
    retList = []
    for r,d,f in os.walk(source):
        for files in f:
            retList.append(os.path.join(r, files))
    return retList

def parsing():
    fileList = pathFilesList()
    for f in fileList:
        print "Processing file: " + str(f)
        fileWordList = []
        fileWordSet = set()
        fw=codecs.open(f,'r', encoding='utf-8')
        fLines = fw.readlines()
        for line in fLines:
            sWord = line.strip()
            fileWordList.append(sWord)
            if sWord not in fileWordSet:
                fileWordSet.add(sWord)
        for stemWord in fileWordSet:
            stemFreq = fileWordList.count(stemWord)
            if stemWord not in wordDict:
                wordDict[stemWord] = [(f[15:-4], stemFreq)]
            else:
                wordDict[stemWord].append((f[15:-4], stemFreq))
        fw.close()

if __name__ == "__main__":
    parsing()
    output = open('data.pkl', 'wb')
    pickle.dump(wordDict, output)
    output.close()

What do you think the problem is?

回答1:

Since this is Python2 you have often have to be more explicit about what encoding your source code is written in. The referenced PEP-0263 explains this in detail. My suggestion is that you try adding the following to the very first two lines of unpickle.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# The rest of your code....

Btw, if you are going to work a lot with non-ascii characters it might be a good idea to use Python3 instead.

回答2:

# Added some code and comments.  To make the code more complete.
# Using collections.Counter to count words.

import os.path
import codecs
import pickle
from collections import Counter

wordDict = {}

def pathFilesList():
    source='StemmedDataset'
    retList = []
    for r, d, f in os.walk(source):
        for files in f:
            retList.append(os.path.join(r, files))
    return retList

# Starts to parse a corpus, it counts the frequency of each word and
# the date of the data (the date is the file name.) then saves words
# as keys of dictionary and the tuple of (freq,date) as values of each
# key.
def parsing():
    fileList = pathFilesList()
    for f in fileList:
        date_stamp = f[15:-4]
        print "Processing file: " + str(f)
        fileWordList = []
        fileWordSet = set()
        # One word per line, strip space. No empty lines.
        fw = codecs.open(f, mode = 'r' , encoding='utf-8')
        fileWords = Counter(w for w in fw.read().split())
        # For each unique word, count occurance and store in dict.
        for stemWord, stemFreq in fileWords.items():
            if stemWord not in wordDict:
                wordDict[stemWord] = [(date_stamp, stemFreq)]
            else:
                wordDict[stemWord].append((date_stamp, stemFreq))
        # Close file and do next.
        fw.close()


if __name__ == "__main__":
    # Parse all files and store in wordDict.
    parsing()

    output = open('data.pkl', 'wb')

    # Assume wordDict is global.
    print "Dumping wordDict of size {0}".format(len(wordDict))
    pickle.dump(wordDict, output)

    output.close()

回答3:

If you are looking for something that saves large dictionaries of data to disk or to a database, and can utilize pickling and encoding (codecs and hashmaps), then you might want to look at klepto.

klepto provides a dictionary abstraction for writing to a database, including treating your filesystem as a database (i.e. writing the entire dictionary to a single file, or writing each entry to it's own file). For large data, I often choose to represent the dictionary as a directory on my filesystem, and have each entry be a file. klepto also offers caching algorithms, so if you are using a filesystem backend for the dictionary you can avoid some speed penalty by utilizing memory caching.

>>> from klepto.archives import dir_archive
>>> d = {'a':1, 'b':2, 'c':map, 'd':None}
>>> # map a dict to a filesystem directory
>>> demo = dir_archive('demo', d, serialized=True) 
>>> demo['a']
1
>>> demo['c']
<built-in function map>
>>> demo          
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> # is set to cache to memory, so use 'dump' to dump to the filesystem 
>>> demo.dump()
>>> del demo
>>> 
>>> demo = dir_archive('demo', {}, serialized=True)
>>> demo
dir_archive('demo', {}, cached=True)
>>> # demo is empty, load from disk
>>> demo.load()
>>> demo
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> demo['c']
<built-in function map>
>>>

klepto also has other flags such as compression and memmode that can be used to customize how your data is stored (e.g. compression level, memory map mode, etc). It's equally easy (the same exact interface) to use a (MySQL, etc) database as a backend instead of your filesystem. You can also turn off memory caching, so every read/write goes directly to the archive, simply by setting cached=False.

klepto provides access to customizing your encoding, by building a custom keymap.

>>> from klepto.keymaps import *
>>> 
>>> s = stringmap(encoding='hex_codec')
>>> x = [1,2,'3',min]
>>> s(x)
'285b312c20322c202733272c203c6275696c742d696e2066756e6374696f6e206d696e3e5d2c29'
>>> p = picklemap(serializer='dill')
>>> p(x)
'\x80\x02]q\x00(K\x01K\x02U\x013q\x01c__builtin__\nmin\nq\x02e\x85q\x03.'
>>> sp = s+p
>>> sp(x)
'\x80\x02UT28285b312c20322c202733272c203c6275696c742d696e2066756e6374696f6e206d696e3e5d2c292c29q\x00.'

Get klepto here: https://github.com/uqfoundation