After reading this old article measuring the memory consumption of several object types, I was amazed to see how much memory String
s use in Java:
length: 0, {class java.lang.String} size = 40 bytes
length: 7, {class java.lang.String} size = 56 bytes
While the article has some tips to minimize this, I did not find them entirely satisfying. It seems to be wasteful to use char[]
for storing the data. The obvious improvement for most western languages would be to use byte[]
and an encoding like UTF-8 instead, as you only need a single byte to store the most frequent characters then instead of two bytes.
Of course one could use String.getBytes("UTF-8")
and new String(bytes, "UTF-8")
. Even the overhead of the String instance itself would be gone. But then there you lose very handy methods like equals()
, hashCode()
, length()
, ...
Sun has a patent on byte[]
representation of Strings, as far as I can tell.
Frameworks for efficient representation of string objects in Java programming environments
... The techniques can be implemented to create Java string objects as arrays of one-byte characters when it is appropriate ...
But I failed to find an API for that patent.
Why do I care?
In most cases I don't. But I worked on applications with huge caches, containing lots of Strings, which would have benefitted from using the memory more efficiently.
Does anybody know of such an API? Or is there another way to keep your memory footprint for Strings small, even at the cost of CPU performance or uglier API?
Please don't repeat the suggestions from the above article:
- own variant of
String.intern()
(possibly withSoftReferences
) - storing a single
char[]
and exploiting the currentString.subString(.)
implementation to avoid data copying (nasty)
Update
I ran the code from the article on Sun's current JVM (1.6.0_10). It yielded the same results as in 2002.
With a Little Bit of Help From the JVM...
WARNING: This solution is now obsolete in newer Java SE versions. See other ad-hoc solutions further below.
If you use an HotSpot JVM, since Java 6 update 21, you can use this command-line option:
The JVM Options page reads:
UPDATE: This feature was broken in a later version and was supposed to be fixed again in Java SE 6u25 as mentioned by the 6u25 b03 release notes (however we don't see it in the 6u25 final release notes). The bug report 7016213 is not visible for security reasons. So, use with care and check first. Like any
-XX
option, it is deemed experimental and subject to change without much notice, so it's probably not always best to not use that in the startup scrip of a production server.UPDATE 2013-03 (thanks to a comment by Aleksey Maximus): See this related question and its accepted answer. The option now seems to be deceased. This is further confirmed in the bug 7129417 report.
The End Justifies the Means
Warning: (Ugly) Solutions for Specific Needs
This is a bit out of the box and lower-level, but since you asked... don't hit the messenger!
Your Own Lighter String Representation
If ASCII is fine for you needs, then why don't you just roll out your own implementation?
As you mentioned, you could
byte[]
instead ofchar[]
internally. But that's not all.To do it even more lightweight, instead of wrapping your byte arrays in a class, why not simply use an helper class containing mostly static methods operating on these byte arrays that you pass around? Sure, it's going to feel pretty C-ish, but it would work, and would save you the huge overhead that goes with
String
objects.And sure, it would miss some nice functionalities... unless your re-implement them. If you really need them, then there's not much choice. Thanks to OpenJDK and a lot of other good projects, you could very well roll out your own fugly
LiteStrings
class that just operate onbyte[]
parameters. You'll feel like taking a shower every time you need to call a function, but you'll have saved heaps of memory.I'd recommend to make it resemble closely the
String
class's contract and to provide meaningful adapters and builders to convert from and toString
, and you might want to also have adapters to and fromStringBuffer
andStringBuilder
, as well as some mirror implementations of other things you might need. Definitely some piece of work, but might be worth it (see a bit below the "Make it Count!" section).On-the-Fly Compression/Decompression
You could very well compress your strings in memory and decompress them on the fly when you need them. After all, you only need to be able to read them when you access them, right?
Of course, being that violent will mean:
Do Both
For a full-headache, of course you can do all of that:
Be sure to make that open-source. :)
Make it Count!
By the way, see this great presentation on Building Memory-Efficient Java Applications by N. Mitchell and G. Sevitsky: [2008 version], [2009 version].
From this presentation, we see that an 8-char string eats 64 bytes on a 32-bit system (96 for a 64-bit system!!), and most of it is due to JVM overhead. And from this article we see that an 8-byte array would eat "only" 24 bytes: 12 bytes of header, 8 x 1 byte + 4 bytes of alignment).
Sounds like this could be worth it if you really manipulate a lot of that stuff (and possibly speed up things a bit, as you'd spend less time allocating memory, but don't quote me on that and benchmark it; plus it would depend greatly on your implementation).
I think you should be very cautious about basing any ideas and/or assumptions off of a javaworld.com article from 2002. There have been many, many changes to the compiler and JVM in the six years since then. At the very least, test your hypothesis and solution against a modern JVM first to make sure that the solution is even worth the effort.
The article points out two things:
The overhead is due to including a char[] object reference, and three ints: an offset, a length, and space for storing the String's hashcode, plus the standard overhead of simply being an object.
Slightly different from String.intern(), or a character array used by String.substring() is using a single char[] for all Strings, this means you do not need to store the object reference in your wrapper String-like object. You would still need the offset, and you introduce a (large) limit on how many characters you can have in total.
You would no longer need the length if you use a special end of string marker. That saves four bytes for the length, but costs you two bytes for the marker, plus the additional time, complexity, and buffer overrun risks.
The space-time trade-off of not storing the hash may help you if you do not need it often.
For an application that I've worked with, where I needed super fast and memory efficient treatment of a large number of strings, I was able to leave the data in its encoded form, and work with byte arrays. My output encoding was the same as my input encoding, and I didn't need to decode bytes to characters nor encode back to bytes again for output.
In addition, I could leave the input data in the byte array it was originally read into - a memory mapped file.
My objects consisted of an int offset (the limit suited my situation), an int length, and an int hashcode.
java.lang.String was the familiar hammer for what I wanted to do, but not the best tool for the job.
I'm currently implementing a compression method as follows (I'm working on an app that needs to store a very large number of documents in memory so we can do document-to-document computation):
long
using masking/bit shifting. If you don't need the full Unicode set and just the 255 ASCII characters, you can fit 8 characters into eachlong
. Add(char) 0
to the end of the string until the length divides evenly by 4 (or 8).TLongHashSet
) and add each "word" to that set, compiling an array of the internal indexes of where thelong
ends up in the set (make sure you also update your index when the set gets rehashed)int
array to store these indexes (so the first dimension is each compressed string, and the second dimension is each "word" index in the hash set), and return the singleint
index into that array back to the caller (you have to own the word arrays so you can globally update the index on a rehash as mentioned above)Advantages:
int
array of length n/4, with the additional overhead of thelong
word set which grows asymptotically as fewer unique "words" are encounteredint
string "ID" which is convenient and small to store in their objectsDistadvantages:
Just compress them all with gzip. :) Just kidding... but I have seen stranger things, and it would give you much smaller data at significant CPU expense.
The only other String implementations that I'm aware of are the ones in the Javolution classes. I don't think that they are more memory efficient, though:
http://www.javolution.com/api/javolution/text/Text.html
http://www.javolution.com/api/javolution/text/TextBuilder.html
Java chose UTF-16 for a compromise of speed and storage size. Processing UTF-8 data is much more PITA than processing UTF-16 data (e.g. when trying to find the position of character X in the byte array, how are you going to do so in a fast manner, if every character can have one, two, three or even up to six bytes? Ever thought about that? Going over the string byte by byte is not really fast, you see?). Of course UTF-32 would be easiest to process, but waste twice the storage space. Things have changed since the early Unicode days. Now certain characters need 4 byte, even when UTF-16 is used. Handling these correctly make UTF-16 almost equally bad as UTF-8.
Anyway, rest assured that if you implement a String class with an internal storage that uses UTF-8, you might win some memory, but you will lose processing speed for many string methods. Also your argument is a way too limited point of view. Your argument will not hold true for someone in Japan, since Japanese characters will not be smaller in UTF-8 than in UTF-16 (actually they will take 3 bytes in UTF-8, while they are only two bytes in UTF-16). I don't understand why programmers in such a global world like today with the omnipresent Internet still talk about "western languages", as if this is all that would count, as if only the western world has computers and the rest of it lives in caves. Sooner or later any application gets bitten by the fact that it fails to effectively process non-western characters.