Why by default only literal strings are saved in the intern pool?
Example from MSDN:
String s1 = "MyTest";
String s2 = new StringBuilder().Append("My").Append("Test").ToString();
String s3 = String.Intern(s2);
Console.WriteLine("s1 == '{0}'", s1);
Console.WriteLine("s2 == '{0}'", s2);
Console.WriteLine("s3 == '{0}'", s3);
Console.WriteLine("Is s2 the same reference as s1?: {0}", (Object)s2==(Object)s1);
Console.WriteLine("Is s3 the same reference as s1?: {0}", (Object)s3==(Object)s1);
/*
This example produces the following results:
s1 == 'MyTest'
s2 == 'MyTest'
s3 == 'MyTest'
Is s2 the same reference as s1?: False
Is s3 the same reference as s1?: True
*/
The short answer: interning literal strings is cheap at runtime and saves memory. Interning non-literal strings is expensive at runtime and therefore saves a tiny amount of memory in exchange for making the common cases much slower.
The cost of the interning-strings-at-runtime "optimization" does not pay for the benefit, and is therefore not actually an optimization. The cost of interning literal strings is cheap and therefore does pay for the benefit.
I answer your question in more detail here:
http://blogs.msdn.com/b/ericlippert/archive/2009/09/28/string-interning-and-string-empty.aspx
The language designers decided the cost of interning every intermediate string value was not worth the performance cost. Interning of garbage-collectible strings requires a single global weak map which can become a bottleneck when you have large numbers of threads.
Interning strings would provide almost no benefit in most string usage scenarios, even if one had a zero-cost weak-reference interning pool (the ideal interning implementation). In order for string interning to offer any benefit, it is necessary that multiple references to coincidentally-equal strings be kept for a reasonably "long" time.
Consider the following two programs:
- Input 100,000 lines from a text file, each containing some arbitrary text, and then 100,000 five-digit numbers. Regard each number read in as a zero-based index into the list of 100,000 lines that were read in, and output the corresponding line to the output.
- Input 100,000 lines from a text file, outputing every line that contains the character sequence "fnord".
For the first program, depending upon the contents of the text file, string interning might generate almost a 50,000:1 savings in memory (if the line contained 100,000 identical long lines of text) or might represent a total waste (if all 100,000 lines are different). In the absence of string interning, an input file with 100,000 identical lines would cause 100,000 live instances of the same string to exist simultaneously. With string interning, the number of live instances could be reduced to two. Of course, there's no way a compiler can even try to guess whether the input file is apt to contain 100,000 identical lines, 100,000 different lines, or something in-between.
For the second program, it's unlikely that even an ideal string-interning implementation would offer much benefit. Even if all 100,000 lines of the input file happened to be identical, interning couldn't save much memory. The effect of interning isn't to prevent the creation of redundant string instances, but rather to allow redundant string instances to be identified and discarded. Since each line can be discarded once it has been examined and either output or not, the only thing interning could buy would be the (theoretical) ability to discard redundant string instances (very) slightly sooner than would otherwise be possible.
There may be benefits in some cases to caching certain 'intermediate' string results, but that's a task that's really best left to the programmer. For example, I have a program which needs to convert a lot of bytes to two-digit hex strings. To facilitate that, I have an array of 255 strings which hold the string equivalents of values from 00 to FF. I know that, on average, each string in that array will be used, at minimum, hundreds or thousands of times, so caching those strings is a huge win. On the other hand, the strings can only be cached because I know what they represent. I may know that, for any n
0-255, String.Format("{0:X2}",n)
will always yield the same value, but I wouldn't expect a compiler to know that.