Why is dumping with `pickle` much faster than `jso

2020-04-16 06:24发布

问题:

This is for Python 3.6.

Edited and removed a lot of stuff that turned out to be irrelevant.

I had thought json was faster than pickle and other answers and comments on Stack Overflow make it seem like a lot of other people believe this as well.

Is my test kosher? The disparity is much larger than I expected. I get the same results testing on very large objects.

import json
import pickle
import timeit

file_name = 'foo'
num_tests = 100000

obj = {1: 1}

command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle: %f seconds" % result)

command = 'json.dumps(obj)'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json:   %f seconds" % result)

and the output:

pickle: 0.054130 seconds
json:   0.467168 seconds

回答1:

I have tried several methods based on your code snippet and found out that using cPickle with setting the protocol argument of the dumps method as: cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL) is the fastest dump method.

import msgpack
import json
import pickle
import timeit
import cPickle
import numpy as np

num_tests = 10

obj = np.random.normal(0.5, 1, [240, 320, 3])

command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle:  %f seconds" % result)

command = 'cPickle.dumps(obj)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle:   %f seconds" % result)


command = 'cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle highest:   %f seconds" % result)

command = 'json.dumps(obj.tolist())'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json:   %f seconds" % result)


command = 'msgpack.packb(obj.tolist())'
setup = 'from __main__ import msgpack, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("msgpack:   %f seconds" % result)

Output:

pickle         :   0.847938 seconds
cPickle        :   0.810384 seconds
cPickle highest:   0.004283 seconds
json           :   1.769215 seconds
msgpack        :   0.270886 seconds


回答2:

JSON serialises in a human readable way. pickle serialises in a binary representation. Nevertheless pickle often is pretty slow. There are variants like cPickle that are faster. If you want even better serialisation, use msgpack.



回答3:

How many times did you run the benchmarking? In any case you need to remove random delays that get introduced by thread blocking etc. You can do so by running your benchmark sufficiently high number of times. Also your input is too small to suppress any delays of 'boiler-plate' code.