Avoid duplicate Strings in Java

2019-02-10 19:15发布

I want to ask a question about avoiding String duplicates in Java.

The context is: an XML with tags and attributes like this one:

<product id="PROD" name="My Product"...></product>

With JibX, this XML is marshalled/unmarshalled in a class like this:

public class Product{
private String id;
private String name;
// constructor, getters, setters, methods  and so on
}

The program is a long-time batch processing, so Product objects are created, used, copied, etc.

Well, the question is: When I analysed the execution with software like Eclipse memory analyzer (MAT), I found several duplicated Strings. For example, in the id attribute, the PROD value is duplicated around 2000 instances, etc.

How can I avoid this situation? Other attributes in Product class may change their value along the execution, but attrs like id, name... don't change so frequently.

I have readed something about String.intern() method, but I haven't used yet and I'm not sure it's a solution for this. Could I define the most frequent values in those attributes like static final constants in the class?

I hope I'd have expressed my question in a right way. Any help or advice is very appreciated. Thanks in advance.

5条回答
Melony?
2楼-- · 2019-02-10 19:33

While String.intern() could solve that problem by reducing each value to a single unique String instance, it would introduce another problem: every intern()-ed String can survive for a long time in the JVM. If the IDs vary a lot (i.e. they are not part of a limited set, but can be any value), then this can have massive negative effects in the long run.

Edit: I used to claim that intern()-ed Strings can't ever be GCed, but @nanda proved me wrong with this JavaWorld article. While this somewhat reduces the problem introduced by intern() it's still not entirely removed: the pool provided by intern() can't be controlled and can have unexpected results with regards to garbage-collection).

Luckily Guava provides a solution in the form of the Interner interface and it's helper class Interners: Using Interners.newStrongInterner() you can create an object that can act as a "pool" of unique String objects much in the same way as String.intern() does, except that the pool is bound to that instance and if you discard the pool, then the content can become eligible for garbage collection as well.

查看更多
放荡不羁爱自由
3楼-- · 2019-02-10 19:33

Yes, interning is the correct solution and you'd done your homework (that is checking with profiler that this is the problem).

Interning can cause problem if you store too much. The permgen memory needs to be increased. Despite what some people said, interned Strings are also garbage collected, so if some strings are not used anymore, it will be object to be garbage collected.

Some supporting articles:

  1. My blog: http://blog.firdau.si/2009/01/06/java-tips-memory-optimization-for-string/
  2. Does intern garbage collected?: http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
  3. Busting the 'Busting String.intern() Myths': http://kohlerm.blogspot.com/2009/01/is-javalangstringintern-really-evil.html
查看更多
ら.Afraid
4楼-- · 2019-02-10 19:34

An alternative solution:

You could try is to define an <xs:enumeration/> restriction on your @id attribute (if your domain model would allow such a thing). If JibX is as intelligent as JAXB or other XML-Java mapping standards, then this could be mapped as a Java enum with constant literals, which can be reused heavily.

I would try that for the ID value, since it kinda looks like an enumeration to me...

查看更多
5楼-- · 2019-02-10 19:38

interning would be the right solution, if you really have a problem. Java stores String literals and a lot of other Strings in an internal pool and whenever a new String is about to be created, the JVM first checks, if the String is already in the pool. If yes, it will not create a new instance but pass the reference to the interned String object.

There are two ways to control this behaviour:

String interned = String.intern(aString); // returns a reference to an interned String
String notInterned = new String(aString); // creates a new String instance (guaranteed)

So maybe, the libraries really create new instances for all xml attribute values. This is possible and you won't be able to change it.


intern has a global effect. An interned String is immediatly available "for any object" (this view doesn't really make sense, but it may help to understand it).

So, lets say we have a line in class Foo, method foolish:

String s = "ABCD";

String literals are interned immediatly. JVM checks, if "ABCD" is already in the pool, if not, "ABCD" is stored in the pool. The JVM assigns a reference to the interned String to s.

Now, maybe in another class Bar, in method barbar:

String t = "AB"+"CD";

Then the JVM will intern "AB" and "CD" like above, create the concatenated String, look, if it is intered already, Hey, yes it is, and assign the reference to the interned String "ABCD" to t.


Calling "PROD".intern() may work or fail. Yes, it will intern the String "PROD". But there's a chance, that jibx really creates new Strings for attribute values with

String value = new String(getAttributeValue(attribute));

In that case, value will not have a reference to an interned String (even if "PROD" is in the pool) but a reference to a new String instance on the heap.

And, to the other question in your command: this happens at runtime only. Compiling simply creates class files, the String pool is a datastructure on the object heap and that is used by the JVM, that executes the application.

查看更多
做个烂人
6楼-- · 2019-02-10 19:47

As everyone know, String objects can be created in two ways, by using the literals and through new operator.

If you use a literal like String test = "Sample"; then this will be cached in String object pool. So interning is not required here as by default the string object will be cached.

But if you create a string object like String test = new String("Sample"); then this string object will not be added to the string pool. So here we need to use String test = new String("Sample").intern(); to forcefully push the string object to the string cache.

So it is always advisable to use string literals than new operator.

So in your case private static final String id = "PROD"; is the right solution.

查看更多
登录 后发表回答