I understand that the internal representation of Java for String is UTF-16. What is java string representation?
Also, I know that in a UTF-16 String, each 'character' is encoded with one or two 16-bit code units.
However, when I debug the following java code
String hello = "Hello";
the variable hello is an array of 5 bytes 0x48, 0x101, 0x108, 0x108, 0x111
which is ASCII for "Hello".
How can this be?
I took a gcore dump of a mini java process with this code:
class Hi {
public static void main(String args[]) {
String hello = "Hello";
try {
Thread.sleep(60_000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
And did a gcore
memory dump on Ubuntu. (usign jps
to get the pid
and passed that to gcore)
If found this: 48 65 6C 6C 6F
in the dump using a Hexeditor, so it is somewhere in the memory as ASCII.
But also 48 00 65 00 6C 00 6C
which is part of the UTF-16
representation of the String
String
internal representation is not specified, it's the implementation detail, so you cannot rely on it. It's very likely that in JDK-9 it will be changed to use double encoding (Latin-1 for strings which can be encoded in Latin-1, UTF-16 for other strings). See JEP-254 for details. This feature is already integrated in OpenJDK master codebase, so if you are using Java-9 early access builds, you will have actually 5 bytes.