After reading this old article measuring the memory consumption of several object types, I was amazed to see how much memory String
s use in Java:
length: 0, {class java.lang.String} size = 40 bytes
length: 7, {class java.lang.String} size = 56 bytes
While the article has some tips to minimize this, I did not find them entirely satisfying. It seems to be wasteful to use char[]
for storing the data. The obvious improvement for most western languages would be to use byte[]
and an encoding like UTF-8 instead, as you only need a single byte to store the most frequent characters then instead of two bytes.
Of course one could use String.getBytes("UTF-8")
and new String(bytes, "UTF-8")
. Even the overhead of the String instance itself would be gone. But then there you lose very handy methods like equals()
, hashCode()
, length()
, ...
Sun has a patent on byte[]
representation of Strings, as far as I can tell.
Frameworks for efficient representation of string objects in Java programming environments
... The techniques can be implemented to create Java string objects as arrays of one-byte characters when it is appropriate ...
But I failed to find an API for that patent.
Why do I care?
In most cases I don't. But I worked on applications with huge caches, containing lots of Strings, which would have benefitted from using the memory more efficiently.
Does anybody know of such an API? Or is there another way to keep your memory footprint for Strings small, even at the cost of CPU performance or uglier API?
Please don't repeat the suggestions from the above article:
- own variant of
String.intern()
(possibly withSoftReferences
) - storing a single
char[]
and exploiting the currentString.subString(.)
implementation to avoid data copying (nasty)
Update
I ran the code from the article on Sun's current JVM (1.6.0_10). It yielded the same results as in 2002.
Today (2010), each GB you add to a server costs about £80 or $120. Before you go re-engineering the String, you should ask yourself it is really worth it.
If you are going to save a GB of memory, perhaps. Ten GB, definitiely. If you want to save 10s of MB, you are likely to use more time than its worth.
How you compact the Strings really depends on your usage pattern. Are there lots of repeated strings? (use an object pool) Are there lots of long strings? (use compression/encoding)
Another reason you might want smaller strings is to reduce cache usage. Even the largest CPUs have about 8 MB - 12 MB of cache. This can be a more precious resource and not easily increased. In this case I suggest you look at alternatives to strings, but you must have in mind how much difference it will make in £ or $ against the time it takes.
Out of curiosity, is the few bytes saved really worth it?
Normally, I suggest ditching strings for performance reasons, in favor of StringBuffer (Remember, Strings are immutable).
Are you seriously exhausting your heap from string references?
At Terracotta, we have some cases where we compress big Strings as they are sent around the network and actually leave them compressed until decompression is necessary. We do this by converting the char[] to byte[], compressing the byte[], then encoding that byte[] back into the original char[]. For certain operations like hash and length, we can answer those questions without decoding the compressed string. For data like big XML strings, you can get substantial compression this way.
Moving the compressed data around the network is a definite win. Keeping it compressed is dependent on the use case. Of course, we have some knobs to turn this off and change the length at which compression turns on, etc.
This is all done with byte code instrumentation on java.lang.String which we've found is very delicate due to how early String is used in startup but is stable if you follow some guidelines.