How to estimate the serialization size of objects

2020-02-09 03:21发布

问题:

To enhance messaging in a cluster, it's important to know at runtime about how big a message is (should I prefer processing local or remote).

I could just find frameworks about estimating the object memory size based on java instrumentation. I've tested classmexer, which didn't come close to the serialization size and sourceforge SizeOf.

In a small testcase, SizeOf was around 10% wrong and 10x faster than serialization. (Still transient breaks the estimation completely and since e.g. ArrayList is transient but is serialized as an Array, it's not easy to patch SizeOf. But I could live with that)

On the other hand, 10x faster with 10% error doesn't seem very good. Any ideas how I could do better?

Update: I also tested ObjectSize (http://sourceforge.net/projects/objectsize-java). Results seem just good for non-inheritating objects :(

回答1:

The size a class takes at runtime doesn't necessarily have any bearing on it's size in memory. The example you've mentioned is transient fields. Other examples include when objects implement Externalizable and handle serialization themselves.

If an object implements Externalizable or provides readObject()/writeObject() then your best bet is to serialize the object to a memory buffer to find out the size. It's not going to be fast, but it will be accurate.

If an object is using the default serialization, then you could amend SizeOf to take into account transient fields.

After serializing many of the same types of objects, you may be able to build up a "serialization profile" for that type that correlates serialized size with runtime size from SizeOf. This will allow you then to estimate the serialized size quickly (using SizeOf) and then correlate this to runtime size, to arrive at a more accurate result than that provided by SizeOf.



回答2:

There are many good points in the other answers, one thing that is lacking is that the serialization mechanism may cache certain objects.

For example you serialize a series of objects A, B, and C all of the same class that hold two objects o1 and o2 in each object. Let us say that the object overhead is 100 bytes and let us say the objects look like:

Object shared = new Object();
Object shread2 = new Object();

A.o1 = new Object()
A.o2 = shared


B.o1 = shared2
B.o2 = shared


C.o1 = shared2
C.o2 = shared

For simplicity sake we might say that the generic objects take 50 bytes to serialize and A's serialization size is 100 (overhead) + 50 (o1) + 50 (o2) = 200 bytes. One could make a similar naive estimation for B and C as well. However if all three are serialized by the same object output stream before reset is called what you will see in the stream is a serialization of A and o1 and o2, Then a serialization of B and o1 for b, BUT a reference to o2 since it was the same object that was already serialzied. So lets say an object reference takes 16 bytes the size of B is now 100 (overhead) + 50 (o1) + 16 (reference for o2) = 166. So the size that it takes to serialize has now changed! We could do a simialr calculation for C and get 132 bytes with two objects cached, so the serialization size for all three objects is different with ~33% difference between the largest and smallest.

So unless you are serializing the entire object without a cache every time it is difficult to accurately estimate the size required to serialize the object.



回答3:

Just an idea - you could serialize the object to a byte buffer first, get its length and decide now whether to send the buffers content to a remote location or do the local processing (if it depends on the messages size).

Drawback - you may waste time for serialization if later to decide not use the buffer. But if you estimate you waste estimation effort in case you need to serialize (because in this case you estimate first and serialize in a second step).



回答4:

There can be no way to estimate the serialized size of the object with nice precision and speed. For example some object could be a cache of Pi number digits that constructs itself during runtime given only the length you need. So it will serialize only 4 bytes of the 'length' attribute, while the object could be using hundreds of megabytes of memory to store that Pi number.

The only solution I can think of is adding your own interface, having method int estimateSerializeSize(). For every object implementing this interface you would need to call this method to get the proper size. If some Object does not implement it - you would have to use SizeOf.