How to persist Python class with member variables

2019-08-08 05:23发布

问题:

The use case: Python class stores large numpy arrays (large, but small enough that working with them in-memory is a breeze) in a useful structure. Here's a cartoon of the situation:

main class: Environment; stores useful information pertinent to all balls

"child" class: Ball; stores information pertinent to this particular ball

Environment member variable: balls_in_environment (list of Balls)

Ball member variable: large_numpy_array (NxN numpy array that is large, but still easy to work with in-memory)

I would like to preferably persist Environment as whole.

Some options:

  • pickle: too slow, and it produces output that takes up a LOT of space on the hard drive

  • database: too much work; I could store the important information in the class in a database (requires me to write functions to take info from the class, and put it into the DB) and later rebuild the class by creating a new instance, and refilling it with data from the DB (requires me to write functions to do the rebuilding)

  • JSON: I am not very familiar with JSON, but Python has a standard library to deal with it, and it is the recommended solution of this article -- I don't see how JSON would be more compact than pickle though; more importantly, doesn't deal nicely with numpy

  • MessagePack: another recommended package by the same article mentioned above; however, I have never heard of it, and don't want to strike out into the unknown with what seems to be a standard problem

  • numpy.save + something else: store the numpy arrays associated with each Ball, using numpy.save functionality, and store the non-numpy stuff separately somehow (tedious)?

What is the best option for my use case?

回答1:

As I mentioned in the comments, joblib.dump might be a good option. It uses np.save to efficiently store numpy arrays, and cPickle for everything else:

import numpy as np
import cPickle
import joblib
import os


class SerializationTest(object):
    def __init__(self):
        self.array = np.random.randn(1000, 1000)

st = SerializationTest()
fnames = ['cpickle.pkl', 'numpy_save.npy', 'joblib.pkl']

# using cPickle
with open(fnames[0], 'w') as f:
    cPickle.dump(st, f)

# using np.save
np.save(fnames[1], st)

# using joblib.dump (without compression)
joblib.dump(st, fnames[2])

# check file sizes
for fname in fnames:
    print('%15s: %8.2f KB' % (fname, os.stat(fname).st_size / 1E3))
#     cpickle.pkl: 23695.56 KB
#  numpy_save.npy:  8000.33 KB
#      joblib.pkl:     0.18 KB

One potential downside is that because joblib.dump uses cPickle to serialize Python objects, the resulting files are not portable from Python 2 to 3. For better portability you could look into using HDF5, e.g. here.