currently I'm working on image retrieval with Python. The keypoints and descriptors extracted from an image in this example are represented as numpy.array
s. The first one of shape (2000, 5) and the latter of shape (2000, 128). Both containing only values of dtype=numpy.float32
.
So, I was wondering which format to use in order to save my extracted keypoints and descriptors. I.e. I'm always saving 2 files: one for the keypoints and one for the descriptors - this counts as one step in my measurements. I compared pickle
, cPickle
(both with protocol 0 and 2) and NumPy's binary format .pny
and the results are really confusing me:
I always thought cPickle
is supposed to be faster than the pickle
module. But especially the load time with protocol 0 really sticks out in the results.
Does anyone have an explanation for this? Is it because I'm only using numeric data? Seems strange...
PS: In my code I'm basically looping 1000 times (number=1000
) over each technique and average the measured time in the end:
timer = time.time
print 'npy save...'
t0 = timer()
for i in range(number):
numpy.save(npy_kp_path, kp)
numpy.save(npy_descr_path, descr)
t1 = timer()
results['npy']['save'] = t1 - t0
print 'npy load...'
t0 = timer()
for i in range(number):
kp = numpy.load(npy_kp_path)
descr = numpy.load(npy_descr_path)
t1 = timer()
results['npy']['load'] = t1 - t0
print 'pickle protocol 0 save...'
t0 = timer()
for i in range(number):
with open(pkl0_descr_path, 'wb') as f:
pickle.dump(descr, f, protocol=0)
with open(pkl0_kp_path, 'wb') as f:
pickle.dump(kp, f, protocol=0)
t1 = timer()
results['pkl0']['save'] = t1 - t0
print 'pickle protocol 0 load...'
t0 = timer()
for i in range(number):
with open(pkl0_descr_path, 'rb') as f:
descr = pickle.load(f)
with open(pkl0_kp_path, 'rb') as f:
kp = pickle.load(f)
t1 = timer()
results['pkl0']['load'] = t1 - t0
print 'cPickle protocol 0 save...'
t0 = timer()
for i in range(number):
with open(cpkl0_descr_path, 'wb') as f:
cPickle.dump(descr, f, protocol=0)
with open(cpkl0_kp_path, 'wb') as f:
cPickle.dump(kp, f, protocol=0)
t1 = timer()
results['cpkl0']['save'] = t1 - t0
print 'cPickle protocol 0 load...'
t0 = timer()
for i in range(number):
with open(cpkl0_descr_path, 'rb') as f:
descr = cPickle.load(f)
with open(cpkl0_kp_path, 'rb') as f:
kp = cPickle.load(f)
t1 = timer()
results['cpkl0']['load'] = t1 - t0
print 'pickle highest protocol (2) save...'
t0 = timer()
for i in range(number):
with open(pkl2_descr_path, 'wb') as f:
pickle.dump(descr, f, protocol=pickle.HIGHEST_PROTOCOL)
with open(pkl2_kp_path, 'wb') as f:
pickle.dump(kp, f, protocol=pickle.HIGHEST_PROTOCOL)
t1 = timer()
results['pkl2']['save'] = t1 - t0
print 'pickle highest protocol (2) load...'
t0 = timer()
for i in range(number):
with open(pkl2_descr_path, 'rb') as f:
descr = pickle.load(f)
with open(pkl2_kp_path, 'rb') as f:
kp = pickle.load(f)
t1 = timer()
results['pkl2']['load'] = t1 - t0
print 'cPickle highest protocol (2) save...'
t0 = timer()
for i in range(number):
with open(cpkl2_descr_path, 'wb') as f:
cPickle.dump(descr, f, protocol=cPickle.HIGHEST_PROTOCOL)
with open(cpkl2_kp_path, 'wb') as f:
cPickle.dump(kp, f, protocol=cPickle.HIGHEST_PROTOCOL)
t1 = timer()
results['cpkl2']['save'] = t1 - t0
print 'cPickle highest protocol (2) load...'
t0 = timer()
for i in range(number):
with open(cpkl2_descr_path, 'rb') as f:
descr = cPickle.load(f)
with open(cpkl2_kp_path, 'rb') as f:
kp = cPickle.load(f)
t1 = timer()
results['cpkl2']['load'] = t1 - t0
The (binary representation of) the numeric data of an
ndarray
is pickled as one long string. It appears thatcPickle
is indeed much slower thanpickle
in unpickling large strings from protocol 0 files. Why? My guess is thatpickle
makes use of well-tuned string algorithms from the standard library andcPickle
has fallen behind.The observation above is from playing with Python 2.7. Python 3.3, which uses a C extension automatically, is faster than either module on Python 2.7, so apparently the issue has been fixed.