Zero-garbage large String deserialization in Java,

2019-01-23 08:54发布

问题:

I am looking for a way to deserialize a String from a byte[] in Java with as little garbage produced as possible. Because I am creating my own serializer and de-serializer, I have complete freedom to implement any solution on the server-side (i.e. when serializing data), and on the client-side (i.e. when de-serializing data).

I have managed to efficiently serialize a String without incurring any garbage overhead by iterating through the String's chars (String.charAt(i)) and converting each char (16-bit value) to 2x 8-bit value. There is a nice debate regarding this here. An alternative is to use Reflection to access String's underlying char[] directly, but this in outside the scope of the problem.

However, it seems impossible for me to deserialize the byte[] without creating the char[] twice, which seems, well, weird.

The procedure:

  1. Create char[]
  2. Iterate through byte[] and fill-in the char[]
  3. Create String with String(char[]) constructor

Because of Java's String immutability rules, the constructor copies the char[], creating 2x GC overhead. I can always use mechanisms to circumvent this (Unsafe String allocation + Reflection to set the char[] instance), but I just wanted to ask if there are any consequences to this other than me breaking every convention on String's immutability.

Of course, the wisest response to this would be "come on, stop doing this and have trust in GC, the original char[] will be extremely short-lived and G1 will get rid of it momentarily", which actually makes sense, if the char[] is smaller than 1/2 of the G1's region size. If it is larger, the char[] will be directly allocated as a humongous object (i.e. automatically propagated outside of the G1's region). Such objects are extremely hard to be efficiently garbage collected in G1. That's why each allocation matters.

Any ideas on how to tackle the issue?

Many thanks.

回答1:

Such objects are extremely hard to be efficiently garbage collected in G1.

This may not be true any longer, but you will have to evaluate it for your own application. JDK Bugs 8027959 and 8048179 introduce new mechanisms for collecting humongous, short-lived objects. According to the bug flags you might have to run with jdk versions ≥8u40 and ≥8u60 to reap their respective benefits.

Experimental option of interest:

-XX:+G1ReclaimDeadHumongousObjectsAtYoungGC

Tracing:

-XX:+G1TraceReclaimDeadHumongousObjectsAtYoungGC

For further advice and questions regarding those features I would recommend hitting the hotspot-gc-use mailing list.



回答2:

I have found a solution, which is useless, if you have an unmanaged environment.

The java.lang.String class has a package-private constructor String(char[] value, boolean share).

Source:

/*
* Package private constructor which shares value array for speed.
* this constructor is always expected to be called with share==true.
* a separate constructor is needed because we already have a public
* String(char[]) constructor that makes a copy of the given char[].
*/
String(char[] value, boolean share) {
    // assert share : "unshared not supported";
    this.value = value;
}

This is being used extensively within Java, e.g. in Integer.toString(), Long.toString(), String.concat(String), String.replace(char, char), String.valueOf(char).

The solution (or hack, whatever you want to call it) is to move the class to java.lang package and to access the package-private constructor. This will not bode well with the security manager, but this can be circumvented.



回答3:

Found a working solution with simple "secret" native Java library:

String longString = StringUtils.repeat("bla", 1000000);
char[] longArray = longString.toCharArray();
String fastCopiedString = SharedSecrets.getJavaLangAccess().newStringUnsafe(longArray);