After reading this old article measuring the memory consumption of several object types, I was amazed to see how much memory String
s use in Java:
length: 0, {class java.lang.String} size = 40 bytes
length: 7, {class java.lang.String} size = 56 bytes
While the article has some tips to minimize this, I did not find them entirely satisfying. It seems to be wasteful to use char[]
for storing the data. The obvious improvement for most western languages would be to use byte[]
and an encoding like UTF-8 instead, as you only need a single byte to store the most frequent characters then instead of two bytes.
Of course one could use String.getBytes("UTF-8")
and new String(bytes, "UTF-8")
. Even the overhead of the String instance itself would be gone. But then there you lose very handy methods like equals()
, hashCode()
, length()
, ...
Sun has a patent on byte[]
representation of Strings, as far as I can tell.
Frameworks for efficient representation of string objects in Java programming environments
... The techniques can be implemented to create Java string objects as arrays of one-byte characters when it is appropriate ...
But I failed to find an API for that patent.
Why do I care?
In most cases I don't. But I worked on applications with huge caches, containing lots of Strings, which would have benefitted from using the memory more efficiently.
Does anybody know of such an API? Or is there another way to keep your memory footprint for Strings small, even at the cost of CPU performance or uglier API?
Please don't repeat the suggestions from the above article:
- own variant of
String.intern()
(possibly withSoftReferences
) - storing a single
char[]
and exploiting the currentString.subString(.)
implementation to avoid data copying (nasty)
Update
I ran the code from the article on Sun's current JVM (1.6.0_10). It yielded the same results as in 2002.
Remember that there are many types of compression. Using huffman encoding is a good general purpose approach - but it is relatively CPU intensive. For a B+Tree implementation I worked on a few years back, we knew that the keys would likely have common leading characters, so we implemented a leading character compression algorithm for each page in the B+Tree. The code was easy, very, very fast, and resulted in a memory usage 1/3 of what we started with. In our case, the real reason for doing this was to save space on disk, and reduce time spent on disk -> RAM transfers (and that 1/3 savings made a huge difference in effective disk performance).
The reason that I bring this up is that a custom String implementation wouldn't have helped very much here. We were only able to achieve the gains we did because we worked the layer of the container that the strings live in.
Trying to optimize a few bytes here and there inside the String object may not be worth it in comparison.
There is the overhead of creating an object (at least a dispatch table), the overhead of the fact that it uses 2 bytes per letter, and the overhead of a few extra variables in there that are created to actually improve speed and memory usage in many cases.
If you are going to use OO programming, this is the cost of having clear, usable, maintainable code.
For an answer besides the obvious (which is that if memory usage is that important, you should probably be using C), you could implement your own Strings with an internal representation in BCD byte-arrays.
That actually sounds fun, I might do it just for kicks :)
A Java array takes 2 bytes per item. A BCD encoded digit takes 6 bits per letter IIRC, making your strings significantly smaller. There would be a little conversion cost in time, but not too bad really. The really big problem is that you'd have to convert to string to do anything with it.
You still have the overhead of an object instance to worry about... but that would be better addressed by revamping your design than trying to eliminate instances.
Finally a note. I'm completely against deploying anything like this unless you have 3 things:
Without all three of those, I'd kick any optimized solution a developer presented to me.
An internal UTF-8 encoding has its advantages (such as the smaller memory footprint that you pointed out), but it has disadvantages too.
For example, determining the character-length (rather than the byte-length) of a UTF-8 encoded string is an O(n) operation. In a java string, the cost of determining the character-length is O(1), while generating the UTF-8 representation is O(n).
It's all about priorities.
Data-structure design can often be seen as a tradeoff between speed and space. In this case, I think the designers of the Java string API made a choice based on these criteria:
The String class must support all possible unicode characters.
Although unicode defines 1 byte, 2 byte, and 4-byte variants, the 4-byte characters are (in practice) pretty rare, so it's okay to represent them as surrogate pairs. That's why java uses a 2-byte char primitive.
When people call length(), indexOf(), and charAt() methods, they're interested in the character position, not the byte position. In order to create fast implementations of these methods, it's necessary to avoid the internal UTF-8 encoding.
Languages like C++ make the programmer's life more complicated by defining three different character types and forcing the programmer to choose between them. Most programmers start off using simple ASCII strings, but when they eventually need to support international characters, the process of modifying the code to use multibyte characters is extremely painful. I think the Java designers made an excellent compromise choice by saying that all strings consist of 2-byte characters.
The UseCompressedStrings compiler option seems like the easiest route to take. If you're using strings only for storage, and not doing any equals/substring/split operations, then something like this CompactCharSequence class could work:
http://www.javamex.com/tutorials/memory/ascii_charsequence.shtml
I believe that Strings are less memory intensive for some time now, because the Java engineers have implemented the flyweight design pattern to share as much as possible. In fact Strings that have the same value point to the very same object in memory I believe.
You said not to repeat the article's suggestion of rolling your own interning scheme, but what's wrong with
String.intern
itself? The article contains the following throwaway remark:But even if the memory usage figures from 2002 still hold six years later, I'd be surprised if no progress has been made on how much data JVMs can intern.
This isn't purely a rhetorical question - I'm interested to know if there are good reasons to avoid it. Is it implemented inefficiently for highly-multithreaded use? Does it fill up some special JVM-specific area of the heap? Do you really have hundreds of megabytes of unique strings (so interning would be useless anyway)?