I am running code that creates large objects, containing multiple user-defined classes, which I must then serialize for later use. From what I can tell, only pickling is versatile enough for my requirements. I've been using cPickle to store them but the objects it generates are approximately 40G in size, from code that runs in 500 mb of memory. Speed of serialization isn't an issue, but size of the object is. Are there any tips or alternate processes I can use to make the pickles smaller?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
You can combine your cPickle
dump
call with a zipfile:And to re-load a zipped pickled object:
You might want to use a more efficient pickling protocol.
As of now, there are three pickle protocols:
and furthermore, the default is protocol 0, the least efficient one:
Let's check the difference in size between using the latest protocol, which is currently protocol 2 (the most efficient one) and using protocol 0 (the default) for an arbitrary example. Note that I use protocol=-1 here, to make sure we are always using the latest protocol, and that I import cPickle to make sure we are using the faster C implementation:
The print out I get is:
Indicating that pickling using the old protocol used up 2172KB, pickling using the new protocol used up 782KB and the difference is a factor of x2.8. Note that this factor is specific to this example - your results might vary, depending on the object you are pickling.
If you must use pickle and no other method of serialization works for you, you can always pipe the pickle through
bzip2
. The only problem is thatbzip2
is a little bit slowish...gzip
should be faster, but the file size is almost 2x bigger:So we see that the file size of the
bzip2
is almost 40x smaller,gzip
is 20x smaller. And gzip is pretty close in performance to the raw cPickle, as you can see: