Why do pickle + gzip outperform h5py on repetitive

I am saving a numpy array which contains repetitive data:

import numpy as np
import gzip
import cPickle as pkl
import h5py

a = np.random.randn(100000, 10)
b = np.hstack( [a[cnt:a.shape[0]-10+cnt+1] for cnt in range(10)] )

f_pkl_gz = gzip.open('noise.pkl.gz', 'w')
pkl.dump(b, f_pkl_gz, protocol = pkl.HIGHEST_PROTOCOL)
f_pkl_gz.close()

f_pkl = open('noise.pkl', 'w')
pkl.dump(b, f_pkl, protocol = pkl.HIGHEST_PROTOCOL)
f_pkl.close()

f_hdf5 = h5py.File('noise.hdf5', 'w')
f_hdf5.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9)
f_hdf5.close()

Now listing the results

-rw-rw-r--. 1 alex alex 76962165 Oct  7 20:51 noise.hdf5
-rw-rw-r--. 1 alex alex 79992937 Oct  7 20:51 noise.pkl
-rw-rw-r--. 1 alex alex  8330136 Oct  7 20:51 noise.pkl.gz

So hdf5 with the highest compression takes approximately as much space as raw pickle and almost 10x the size of gzipped pickle.

Does anyone have an idea why this happens? And what can I do with this?

标签： python numpy gzip pickle h5py

1条回答

甜甜的少女心

2楼-- · 2019-07-17 20:40

The answer is to use chunks, as suggested by @tcaswell. I guess that the compression is performed separately on each chunk and the default size of the chunks is small, so there is not enough redundancy in the data for the compression to benefit from it.

Here's the code to give an idea:

import numpy as np
import gzip
import cPickle as pkl
import h5py

a = np.random.randn(100000, 10)
b = np.hstack( [a[cnt:a.shape[0]-10+cnt+1] for cnt in range(10)] )

f_hdf5_chunk_1 = h5py.File('noise_chunk_1.hdf5', 'w')
f_hdf5_chunk_1.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (1,100))
f_hdf5_chunk_1.close()

f_hdf5_chunk_10 = h5py.File('noise_chunk_10.hdf5', 'w')
f_hdf5_chunk_10.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (10,100))
f_hdf5_chunk_10.close()

f_hdf5_chunk_100 = h5py.File('noise_chunk_100.hdf5', 'w')
f_hdf5_chunk_100.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (100,100))
f_hdf5_chunk_100.close()

f_hdf5_chunk_1000 = h5py.File('noise_chunk_1000.hdf5', 'w')
f_hdf5_chunk_1000.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (1000,100))
f_hdf5_chunk_1000.close()

f_hdf5_chunk_10000 = h5py.File('noise_chunk_10000.hdf5', 'w')
f_hdf5_chunk_10000.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (10000,100))
f_hdf5_chunk_10000.close()

And the results:

-rw-rw-r--. 1 alex alex  8341134 Oct  7 21:53 noise_chunk_10000.hdf5
-rw-rw-r--. 1 alex alex  8416441 Oct  7 21:53 noise_chunk_1000.hdf5
-rw-rw-r--. 1 alex alex  9096936 Oct  7 21:53 noise_chunk_100.hdf5
-rw-rw-r--. 1 alex alex 16304949 Oct  7 21:53 noise_chunk_10.hdf5
-rw-rw-r--. 1 alex alex 85770613 Oct  7 21:53 noise_chunk_1.hdf5

So as the chunks become smaller, the size of the file increases.

0人赞添加讨论(0) 举报

Why do pickle + gzip outperform h5py on repetitive

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间