Fastest way to store large files in Python

I recently asked a question regarding how to save large python objects to file. I had previously run into problems converting massive Python dictionaries into string and writing them to file via write(). Now I am using pickle. Although it works, the files are incredibly large (> 5 GB). I have little experience in the field of such large files. I wanted to know if it would be faster, or even possible, to zip this pickle file prior to storing it to memory.

标签： python compression pickle

5条回答

smile是对你的礼貌

2楼-- · 2020-06-03 00:52

You can compress the data with bzip2:

from __future__ import with_statement # Only for Python 2.5
import bz2,json,contextlib

hugeData = {'key': {'x': 1, 'y':2}}
with contextlib.closing(bz2.BZ2File('data.json.bz2', 'wb')) as f:
  json.dump(hugeData, f)

Load it like this:

from __future__ import with_statement # Only for Python 2.5
import bz2,json,contextlib

with contextlib.closing(bz2.BZ2File('data.json.bz2', 'rb')) as f:
  hugeData = json.load(f)

You can also compress the data using zlib or gzip with pretty much the same interface. However, both zlib and gzip's compression rates will be lower than the one achieved with bzip2 (or lzma).

0人赞添加讨论(0) 举报

在下西门庆

3楼-- · 2020-06-03 00:55

I'd just expand on phihag's answer.

When trying to serialize an object approaching the size of RAM, pickle/cPickle should be avoided, since it requires additional memory of 1-2 times the size of the object in order to serialize. That's true even when streaming it to BZ2File. In my case I was even running out of swap space.

But the problem with JSON (and similarly with HDF files as mentioned in the linked article) is that it cannot serialize tuples, which in my data are used as keys to dicts. There is no great solution for this; the best I could find was to convert tuples to strings, which requires some memory of its own, but much less than pickle. Nowadays, you can also use the ujson library, which is much faster than the json library.

For tuples composed of strings (requires strings to contain no commas):

import ujson as json
from bz2 import BZ2File

bigdata = { ('a','b','c') : 25, ('d','e') : 13 }
bigdata = dict([(','.join(k), v) for k, v in bigdata.viewitems()]) 

f = BZ2File('filename.json.bz2',mode='wb')
json.dump(bigdata,f)
f.close()

To re-compose the tuples:

bigdata = dict([(tuple(k.split(',')),v) for k,v in bigdata.viewitems()])

Alternatively if e.g. your keys are 2-tuples of integers:

bigdata2 = { (1,2): 1.2, (2,3): 3.4}
bigdata2 = dict([('%d,%d' % k, v) for k, v in bigdata2.viewitems()])
# ... save, load ...
bigdata2 = dict([(tuple(map(int,k.split(','))),v) for k,v in bigdata2.viewitems()])

Another advantage of this approach over pickle is that json appears to compress a significantly better than pickles when using bzip2 compression.

0人赞添加讨论(0) 举报

我命由我不由天

4楼-- · 2020-06-03 01:04

Python code would be extremely slow when it comes to implementing data serialization. If you try to create an equivalent to Pickle in pure Python, you'll see that it will be super slow. Fortunately the built-in modules which perform that are quite good.

Apart from cPickle, you will find the marshal module, which is a lot faster. But it needs a real file handle (not from a file-like object). You can import marshal as Pickle and see the difference. I don't think you can make a custom serializer which is a lot faster than this...

Here's an actual (not so old) serious benchmark of Python serializers

0人赞添加讨论(0) 举报

smile是对你的礼貌

5楼-- · 2020-06-03 01:05

faster, or even possible, to zip this pickle file prior to [writing]

Of course it's possible, but there's no reason to try to make an explicit zipped copy in memory (it might not fit!) before writing it, when you can automatically cause it to be zipped as it is written, with built-in standard library functionality ;)

See http://docs.python.org/library/gzip.html . Basically, you create a special kind of stream with

gzip.GzipFile("output file name", "wb")

and then use it exactly like an ordinary file created with open(...) (or file(...) for that matter).

0人赞添加讨论(0) 举报

贪生不怕死

6楼-- · 2020-06-03 01:14

Look at Google's ProtoBuffers. Although they are not designed for large files out-of-the box, like audio-video files, they do well with object serialization as in your case, because they were designed for it. Practice shows that some day you may need to update structure of your files, and ProtoBuffers will handle it. Also, they are highly optimized for compression and speed. And you're not tied to Python, Java and C++ are well supported.

0人赞添加讨论(0) 举报

Fastest way to store large files in Python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间