to give you context:
I have a large file f
, several Gigs in size. It contains consecutive pickles of different object that were generated by running
for obj in objs: cPickle.dump(obj, f)
I want to take advantage of buffering when reading this file. What I want, is to read several picked objects into a buffer at a time. What is the best way of doing this? I want an analogue of readlines(buffsize)
for pickled data. In fact if the picked data is indeed newline delimited one could use readlines, but I am not sure if that is true.
Another option that I have in mind is to dumps()
the pickled object to a string first and then to write the strings to a file, each separated by a newline. To read the file back I can use readlines()
and loads()
. But I fear that a pickled object may have the "\n"
character and it will throw off this file reading scheme. Is my fear unfounded ?
One option is to pickle it out as a huge list of objects, but that will require more memory than I can afford. The setup can be sped up by multi-threading but I do not want to go there before I get the buffering working properly. Whats the "best practice" for situations like this.
EDIT:
I can also read in raw bytes into a buffer and invoke loads on that, but I need to know how many bytes of that buffer was consumed by loads so that I can throw the head away.
file.readlines() returns a list of the entire contents of the file. You'll want to read a few lines at a time. I think this naive code should unpickle your data:
import pickle
infile = open('/tmp/pickle', 'rb')
buf = []
while True:
line = infile.readline()
if not line:
break
buf.append(line)
if line.endswith('.\n'):
print 'Decoding', buf
print pickle.loads(''.join(buf))
buf = []
If you have any control over the program that generates the pickles, I'd pick one of:
- Use the
shelve
module.
- Print the length (in bytes) of each pickle before writing it to the file so that you know exactly how many bytes to read in each time.
- Same as above, but write the list of integers to a separate file so that you can use those values as an index into the file holding the pickles.
- Pickle a list of K objects at a time. Write the length of that pickle in bytes. Write the pickle. Repeat.
By the way, I suspect that the file
's built-in buffering should get you 99% of the performance gains you're looking for.
If you're convinced that I/O is blocking you, have you thought about trying mmap()
and letting the OS handle packing in blocks at a time?
#!/usr/bin/env python
import mmap
import cPickle
fname = '/tmp/pickle'
infile = open(fname, 'rb')
m = mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ)
start = 0
while True:
end = m.find('.\n', start + 1) + 2
if end == 1:
break
print cPickle.loads(m[start:end])
start = end
You don't need to do anything, i think.
>>> import pickle
>>> import StringIO
>>> s = StringIO.StringIO(pickle.dumps('apples') + pickle.dumps('bananas'))
>>> pickle.load(s)
'apples'
>>> pickle.load(s)
'bananas'
>>> pickle.load(s)
Traceback (most recent call last):
File "<pyshell#25>", line 1, in <module>
pickle.load(s)
File "C:\Python26\lib\pickle.py", line 1370, in load
return Unpickler(file).load()
File "C:\Python26\lib\pickle.py", line 858, in load
dispatch[key](self)
File "C:\Python26\lib\pickle.py", line 880, in load_eof
raise EOFError
EOFError
>>>
You might want to look at the shelve module. It uses a database module such as dbm
to create an on-disk dictionary of objects. The objects themselves are still serialized using pickle. That way you could read sets of objects instead of one big pickle at a time.
If you want to add buffering to any file, open it via io.open()
. Here is an example which will read from the underlying stream in 128K chunks. Each call to cPickle.load()
will be fulfilled from the internal buffer until it is exhausted, then another chunk will be read from the underlying file:
import cPickle
import io
buf = io.open('objects.pkl', 'rb', buffering=(128 * 1024))
obj = cPickle.load(buf)