Is string interning really useful?

2019-02-04 05:30发布

I was having a conversation about strings and various languages a while back, and the topic of string interning came up. Apparently Java and the .NET framework do this automatically with all strings, as well as several scripting languages. Theoretically, it saves memory because you don't end up with multiple copies of the same string, and it saves time because string equality comparisons are a simple pointer comparison instead of an O(N) run through each character of the string.

But the more I think about it, the more skeptical I grow of the concept's benefits. It seems to me that the advantages are mostly theoretical:

  • First off, to use automatic string interning, all strings must be immutable, which makes a lot of string processing tasks harder than they need to be. (And yes, I've heard all the arguments for immutability in general. That's not the point.)
  • Every time a new string is created, it has to be checked against the string interning table, which is at least a O(N) operation. (EDIT: Where N is the size of the string, not the size of the table, since this was confusing people.) So unless the ratio of string equality comparisons to new string creation is pretty high, it's unlikely that the net time saved is a positive value.
  • If the string equality table uses strong references, the strings will never get garbage collected when they're no longer needed, thus wasting memory. On the other hand, if the table uses weak references, then the string class requires some sort of finalizer to remove the string from the table, thus slowing down the GC process. (Which could be pretty significant, depending on how the string intern table is implemented. Worst case, deleting an item from a hash table can require an O(N) rebuild of the entire table under certain circumstances.)

This is just the result of me thinking about implementation details. Is there something I've missed? Does string interning actually provide any significant benefits in the general case?

EDIT 2: All right, apparently I was operating from a mistaken premise. The person I was talking to never pointed out that string interning was optional for newly-created strings, and in fact gave the strong impression that the opposite was true. Thanks to Jon for setting the matter straight. Another accepted answer for him.

7条回答
何必那么认真
2楼-- · 2019-02-04 06:21

The points you listed are all valid to a certain extent. But there are important counter-arguments.

  1. Immutability is very important, especially if you're using hash maps, and they are used a lot.
  2. String composition operations are very slow anyway, because you have to constantly reallocate the array containing the characters.
  3. On the other hand, subString() operations are very fast.
  4. String equality is indeed used a lot, and you're not losing anything there. The reason being that strings aren't interned automatically. In fact in Java if the references are different, equals() falls back to a character by character comparison.
  5. Clearly, using strong references for the intern table isn't a good idea. You have to live with the GC overhead.
  6. Java string handling was designed to be space-efficient, especially on constant strings and substring operations.

On balance I'd say it is worth it in most cases and fits well with the VM-managed heap concept. I could imagine some special scenarios where it could be a real pain though.

查看更多
登录 后发表回答