My application is multithreaded with intensive String processing. We are experiencing excessive memory consumption and profiling has demonstrated that this is due to String data. I think that memory consumption would benefit greatly from using some kind of flyweight pattern implementation or even cache (I know for sure that Strings are often duplicated, although I don't have any hard data in that regard).
I have looked at Java Constant Pool and String.intern, but it seems that it can provoke some PermGen problems.
What would be the best alternative for implementing application-wide, multithreaded pool of Strings in java?
EDIT: Also see my previous, related question: How does java implement flyweight pattern for string under the hood?
Java 7/8
If you are doing what the accepted answer says and using Java 7 or newer you are not doing what it says you are.
The implementation of
subString()
has changed.Never write code that relies on an implementation that can change drastically and might make things worse if you are relying on the old behavior.
So if you use the accepted answer with Java 7 or newer you are creating twice as much memory usage and garbage that needs to be collected.
This is already done at the JVM level. You only need to ensure that you aren't creating
new String
s everytime, either explicitly or implicitly.I.e. don't do:
This would create two instances in the heap. Rather do so:
This will create one instance in the heap and both will refer the same (as evidence,
s1 == s2
will returntrue
here).Also don't use
+=
to concatenate strings (in a loop):The
+=
implicitly creates anew String
in the heap everytime. Rather do soIf you can, rather use
StringBuilder
or its synchronized brotherStringBuffer
instead ofString
for "intensive String processing". It offers useful methods for exactly those purposes, such asappend()
,insert()
,delete()
, etc. Also see its javadoc.Effeciently pack Strings in memory! I once wrote a hyper memory efficient Set class, where Strings were stored as a tree. If a leaf was reached by traversing the letters, the entry was contained in the set. Fast to work with, too, and ideal to store a large dictionary.
And don't forget that Strings are often the largest part in memory in nearly every app I profiled, so don't care for them if you need them.
Illustration:
You have 3 Strings: Beer, Beans and Blood. You can create a tree structure like this:
Very efficient for e.g. a list of street names, this is obviously most reasonable with a fixed dictionary, because insert cannot be done efficiently. In fact the structure should be created once, then serialized and afterwards just loaded.
Note: This answer uses examples that might not be relevant in modern runtime JVM libraries. In particular, the
substring
example is no longer an issue in OpenJDK/Oracle 7+.I know it goes against what people often tell you, but sometimes explicitly creating new
String
instances can be a significant way to reduce your memory.Because Strings are immutable, several methods leverage that fact and share the backing character array to save memory. However, occasionally this can actually increase the memory by preventing garbage collection of unused parts of those arrays.
For example, assume you were parsing the message IDs of a log file to extract warning IDs. Your code would look something like this:
But look at the data actually being stored:
It's the whole test line, because the matcher just wraps a new String instance around the same character data. Compare the results when you replace
String id = matcher.group(1);
withString id = new String(matcher.group(1));
.First, decide how much your application and developers would suffer if you eliminated some of that parsing. A faster application does you no good if you double your employee turnover rate in the process! I think based on your question we can assume you passed this test already.
Second, if you can't eliminate creating an object, then your next goal should be to ensure it doesn't survive Eden collection. And parse-lookup can solve that problem. However, a cache "implemented properly" (I disagree with that basic premise, but I won't bore you with the attendant rant) usually brings thread contention. You'd be replacing one kind of memory pressure for another.
There's a variation of the parse-lookup idiom that suffers less from the sort of collateral damage you usually get from full-on caching, and that's a simple precalculated lookup table (see also "memoization"). The Pattern you usually see for this is the Type Safe Enumeration (TSE). With the TSE, you parse the String, pass it to the TSE to retrieve the associated enumerated type, and then you throw the String away.
Is the text you're processing free-form, or does the input have to follow a rigid specification? If a lot of your text renders down to a fixed set of possible values, then a TSE could help you here, and serves a greater master: Adding context/semantics to your information at the point of creation, instead of at the point of use.