Why pickle.dump(obj) has different size with sys.g

2019-08-30 11:30发布

问题:

I use classifier of random forest from scikit lib of python to do my exercise. The result changes each running time. So I run 1000 times and get the average result.

I save object rf into files to predict later by pickle.dump() and get about 4MB each file. However, sys.getsizeof(rf) give me just 36 bytes

rf = RandomForestClassifier(n_estimators = 50)
rf.fit(matX, vecY)
pickle.dump(rf,'var.sav')

My questions:

  • sys.getsizeof() seems to be wrong in getting size of RandomForestClassifier object, doesn't it? why?
  • How to save object in zip file so that it has smaller size?

回答1:

getsizeof() gives you the memory footprint of just the object, and not of any other values referenced by that object. You'd need to recurse over the object to find the total size of all attributes too, and anything those attributes hold, etc.

Pickling is a serialization format. Serialization needs to store metadata as well as the contents of the object. Memory size and pickle size only have a rough correlation.

Pickles are byte streams, if you need to have a more compact bytestream, use compression.

If you are storing your pickles in a ZIP file, your data will already be compressed; compressing the pickle before storing it in the ZIP will not help in that case as already compressed data runs the risk to become bigger after additional ZIP compression instead due to metadata overhead and lack of duplicate data in typical compressed data.