List<String> list = new ArrayList<>();
for (int i = 0; i < 1000; i++)
{
StringBuilder sb = new StringBuilder();
String string = sb.toString();
string = string.intern()
list.add(string);
}
In the above sample, after invoking string.intern() method, when will the 1000 objects created in heap (sb.toString) be cleared?
Edit 1:
If there is no guarantee that these objects could be cleared. Assuming that GC haven't run, is it obsolete to use string.intern() itself? (In terms of the memory usage?)
Is there any way to reduce memory usage / object creation while using intern() method?
Your example is a bit odd, as it creates 1000 empty strings. If you want to get such a list with consuming minimum memory, you should use
List<String> list = Collections.nCopies(1000, "");
instead.
If we assume that there is something more sophisticated going on, not creating the same string in every iteration, well, then there is no benefit in calling intern()
. What will happen, is implementation dependent. But when calling intern()
on a string that is not in the pool, it will be just added to the pool in the best case, but in the worst case, another copy will be made and added to the pool.
At this point, we have no savings yet, but potentially created additional garbage.
Interning at this point can only save you some memory, if there are duplicates somewhere. This implies that you construct duplicate strings first, to look up their canonical instance via intern()
afterwards, so having the duplicate string in memory until garbage collected, is unavoidable. But that’s not the real problem with interning:
- in older JVMs, there was special treatment of interned string that could result in worse garbage collection performance or even running out of resources (i.e. the fixed size “PermGen” space).
- in HotSpot, the string pool holding the interned strings is a fixed size hash table, yielding hash collisions, hence, poor performance, when referencing significantly more strings than the table size.
Before Java 7, update 40, the default size was about 1,000, not even sufficient to hold all string constants for any nontrivial application without hash collisions, not to speak of manually added strings. Later versions use a default size of about 60,000, which is better, but still a fixed size that should discourage you from adding an arbitrary number of strings
- the string pool has to obey inter-thread semantics mandated by the language specification (as it is used to for string literals), hence, need to perform thread safe updates that can degrade the performance
Keep in mind that you pay the price of the disadvantages named above, even in the cases that there are no duplicates, i.e. there is no space saving. Also, the acquired reference to the canonical string has to have a much longer lifetime than the temporary object used to look it up, to have any positive effect on the memory consumption.
The latter touches your literal question. The temporary instances are reclaimed when the garbage collector runs the next time, which will be when the memory is actually needed. There is no need to worry about when this will happen, but well, yes, up to that point, acquiring a canonical reference had no positive effect, not only because the memory hasn’t been reused up to that point, but also, because the memory was not actually needed until then.
This is the place to mention the new String Deduplication feature. This does not change string instances, i.e. the identity of these objects, as that would change the semantic of the program, but change identical strings to use the same char[]
array. Since these character arrays are the biggest payload, this still may achieve great memory savings, without the performance disadvantages of using intern()
. Since this deduplication is done by the garbage collector, it will only applied to strings that survived long enough to make a difference. Also, this implies that it will not waste CPU cycles when there still is plenty of free memory.
However, there might be cases, where manual canonicalization might be justified. Imagine, we’re parsing a source code file or XML file, or importing strings from an external source (Reader
or data base) where such canonicalization will not happen by default, but duplicates may occur with a certain likelihood. If we plan to keep the data for further processing for a longer time, we might want to get rid of duplicate string instances.
In this case, one of the best approaches is to use a local map, not being subject to thread synchronization, dropping it after the process, to avoid keeping references longer than necessary, without having to use special interaction with the garbage collector. This implies that occurrences of the same strings within different data sources are not canonicalized (but still being subject to the JVM’s String Deduplication), but it’s a reasonable trade-off. By using an ordinary resizable HashMap
, we also do not have the issues of the fixed intern
table.
E.g.
static List<String> parse(CharSequence input) {
List<String> result = new ArrayList<>();
Matcher m = TOKEN_PATTERN.matcher(input);
CharBuffer cb = CharBuffer.wrap(input);
HashMap<CharSequence,String> cache = new HashMap<>();
while(m.find()) {
result.add(
cache.computeIfAbsent(cb.subSequence(m.start(), m.end()), Object::toString));
}
return result;
}
Note the use of the CharBuffer
here: it wraps the input sequence and its subSequence
method returns another wrapper with different start and end index, implementing the right equals
and hashCode
method for our HashMap
, and computeIfAbsent
will only invoke the toString
method, if the key was not present in the map before. So, unlike using intern()
, no String
instance will be created for already encountered strings, saving the most expensive aspect of it, the copying of the character arrays.
If we have a really high likelihood of duplicates, we may even save the creation of wrapper instances:
static List<String> parse(CharSequence input) {
List<String> result = new ArrayList<>();
Matcher m = TOKEN_PATTERN.matcher(input);
CharBuffer cb = CharBuffer.wrap(input);
HashMap<CharSequence,String> cache = new HashMap<>();
while(m.find()) {
cb.limit(m.end()).position(m.start());
String s = cache.get(cb);
if(s == null) {
s = cb.toString();
cache.put(CharBuffer.wrap(s), s);
}
result.add(s);
}
return result;
}
This creates only one wrapper per unique string, but also has to perform one additional hash lookup for each unique string when putting. Since the creation of a wrapper is quiet cheap, you really need a significantly large number of duplicate strings, i.e. small number of unique strings compared to the total number, to have a benefit from this trade-off.
As said, these approaches are very efficient, because they use a purely local cache that is just dropped afterwards. With this, we don’t have to deal with thread safety nor interact with the JVM or garbage collector in a special way.
You can open JMC and check for GC under Memory tab inside MBean Server of the particular JVM when it performed and how much did it cleared. Still, there is no fixed guarantee of the time when it would be called. You can initiate GC under Diagnostic Commands on a specific JVM.
Hope it helps.