I recently asked a question regarding how to save large python objects to file. I had previously run into problems converting massive Python dictionaries into string and writing them to file via write()
. Now I am using pickle. Although it works, the files are incredibly large (> 5 GB). I have little experience in the field of such large files. I wanted to know if it would be faster, or even possible, to zip this pickle file prior to storing it to memory.
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
You can compress the data with bzip2:
Load it like this:
You can also compress the data using zlib or gzip with pretty much the same interface. However, both zlib and gzip's compression rates will be lower than the one achieved with bzip2 (or lzma).
I'd just expand on phihag's answer.
When trying to serialize an object approaching the size of RAM, pickle/cPickle should be avoided, since it requires additional memory of 1-2 times the size of the object in order to serialize. That's true even when streaming it to BZ2File. In my case I was even running out of swap space.
But the problem with JSON (and similarly with HDF files as mentioned in the linked article) is that it cannot serialize tuples, which in my data are used as keys to dicts. There is no great solution for this; the best I could find was to convert tuples to strings, which requires some memory of its own, but much less than pickle. Nowadays, you can also use the ujson library, which is much faster than the json library.
For tuples composed of strings (requires strings to contain no commas):
To re-compose the tuples:
Alternatively if e.g. your keys are 2-tuples of integers:
Another advantage of this approach over pickle is that json appears to compress a significantly better than pickles when using bzip2 compression.
Python code would be extremely slow when it comes to implementing data serialization. If you try to create an equivalent to Pickle in pure Python, you'll see that it will be super slow. Fortunately the built-in modules which perform that are quite good.
Apart from
cPickle
, you will find themarshal
module, which is a lot faster. But it needs a real file handle (not from a file-like object). You canimport marshal as Pickle
and see the difference. I don't think you can make a custom serializer which is a lot faster than this...Here's an actual (not so old) serious benchmark of Python serializers
Of course it's possible, but there's no reason to try to make an explicit zipped copy in memory (it might not fit!) before writing it, when you can automatically cause it to be zipped as it is written, with built-in standard library functionality ;)
See http://docs.python.org/library/gzip.html . Basically, you create a special kind of stream with
and then use it exactly like an ordinary
file
created withopen(...)
(orfile(...)
for that matter).Look at Google's ProtoBuffers. Although they are not designed for large files out-of-the box, like audio-video files, they do well with object serialization as in your case, because they were designed for it. Practice shows that some day you may need to update structure of your files, and ProtoBuffers will handle it. Also, they are highly optimized for compression and speed. And you're not tied to Python, Java and C++ are well supported.