I've seen many primitive examples describing how String intern()'ing works, but I have yet to see a real-life use-case that would benefit from it.
The only situation that I can dream up is having a web service that receives a considerable amount of requests, each being very similar in nature due to a rigid schema. By intern()'ing the request field names in this case, memory consumption can be significantly reduced.
Can anyone provide an example of using intern() in a production environment with great success? Maybe an example of it in a popular open source offering?
Edit: I am referring to manual interning, not the guaranteed interning of String literals, etc.
Interning can be very beneficial if you have
N
strings that can take onlyK
different values, whereN
far exceedsK
. Now, instead of storingN
strings in memory, you will only be storing up toK
.For example, you may have an
ID
type which consists of 5 digits. Thus, there can only be10^5
different values. Suppose you're now parsing a large document that has many references/cross references toID
values. Let's say this document have10^9
references total (obviously some references are repeated in other parts of the documents).So
N = 10^9
andK = 10^5
in this case. If you are not interning the strings, you will be storing10^9
strings in memory, where lots of those strings areequals
(by Pigeonhole Principle). If youintern()
theID
string you get when you're parsing the document, and you don't keep any reference to the uninterned strings you read from the document (so they can be garbage collected), then you will never need to store more than10^5
strings in memory.We had a production system that processes literally millions of pieces of data at a time, many of which have string fields. We should have been interning strings, but there was a bug which meant we were not. By fixing the bug we avoided having to do a very costly (at least 6 figures, possibly 7) server upgrade.
Examples where interning will be beneficial involve a large numbers strings where:
Typical examples involve splitting / parsing a text into symbols (words, identifiers, URIs) and then attaching those symbols to long-lived data structures. XML processing, programming language compilation and RDF / OWL triple stores spring to mind as applications where interning is likely to be beneficial.
But interning is not without its problems, especially if it turns out that the assumptions above are not correct:
Finally, interning potentially increases GC overheads by increasing the number of objects that need to be traced and copied, and by increasing the number of weak references that need to be dealt with. This increase in overheads has to be balanced against the decrease in GC overheads that results from effective interning.
Not a complete answer but additional food for thought (found here):
Never, ever, use intern on user-supplied data, as that can cause denial of service attacks (as intern()ed strings are never freed). You can do validation on the user-supplied strings, but then again you've done most of the work needed for intern().