msgpack
in Pandas is supposed to be a replacement for pickle
.
Per the Pandas docs on msgpack:
This is a lightweight portable binary format, similar to binary JSON, that is highly space efficient, and provides good performance both on the writing (serialization), and reading (deserialization).
I find, however, that its performance does not appear to stack up against pickle.
df = pd.DataFrame(np.random.randn(10000, 100))
>>> %timeit df.to_pickle('test.p')
10 loops, best of 3: 22.4 ms per loop
>>> %timeit df.to_msgpack('test.msg')
10 loops, best of 3: 36.4 ms per loop
>>> %timeit pd.read_pickle('test.p')
100 loops, best of 3: 10.5 ms per loop
>>> %timeit pd.read_msgpack('test.msg')
10 loops, best of 3: 24.6 ms per loop
Question: Asides from potential security issues with pickle, what are the benefits of msgpack over pickle? Is pickle still the preferred method of serializing data, or do better alternatives currently exist?
Pickle is better for the following:
protocol=
)cloudpickle
)MsgPack is better for the following:
As @Jeff noted above this blogpost may be of interest