Since Java's default string interning has got a lot of bad press, I am looking for an alternative.
Can you suggest an API which is a good alternative to Java string interning? My application uses Java 6. My requirement is mainly to avoid duplicate strings via interning.
Regarding the bad press:
- String intern is implemented via a native method. And the C implementation uses a fixed size of some 1k entries and scales very poorly for large number of strings.
- Java 6 stores interned strings in Perm gen. And therefore are not GC'd and possibly lead to perm gen errors. I know this is fixed in java 7 but I can't upgrade to java 7.
Why do I need to use intering?
- My application is a server app with heap size of 10-20G for different deployments.
- During profiling we have figured that hundrends of thousands of string are duplicates and we can significantly improve the memory usage by avoiding storing duplicate strings.
- Memory has been a bottleneck for us and therefore we are targetting it rather than doing any premature optimization.
It scales poorly for many thousand Strings.
It will be cleaned up when the perm gen is cleaned up which is not often but it can mean you reach the maximum of this space if you don't increase it.
I suggest you consider using off heap memory. I have 500 GB in off heap memory and about 1 GB in heap in one application. It isn't useful in all cases but worth considering.
For this I have used a simple array of String. This is very light weight and you can control the upper bound of Strings stored easily.
Here is an example of generic interner.
An interest property of this cache is it doesn't matter that its not thread safe.
For extra speed you can use a power of 2 size and a bit mask, but its more complicated and may not work very well depending on how your hashCodes are calculated.