JVM, the constant pool, the heap and the addresses

2019-09-01 01:55发布

问题:

If I create a new item in Jasmin assembly and then store it, I do it with the instruction aload, since it's an address:

    new Object
    dup
    invokespecial.....
    astore_3 ; load the object reference into local variable 3

Now, if I want to save a string from the constant pool... I would create it with ldc and then save it with aload as well:

    ldc "Great string"
    astore_3 ; save the reference to the actual string in the constant pool

Now... are these addresses on the same form and the same number of bytes? Since I use the same instruction to load and to store these items, the JVM has to be able to make a distinction between addresses that belongs in the constant pool and addresses in the heap?

Upon inspecting the bytecode, it seems that the actual address in the constant pool in my case is just a 1-byte index (I guess a main reference to the constant pool is kept somewhere as well)... now I know that that is a reference to som UTF8 data in the constant pool but is that where the actual string lies or is that just a reference to an array of bytes someplace else? Inspecting the address of the "new Object" in the heap I haven't been able to do..... basically, I need to work out how these two memory areas can use the same form of instructions and how the JVM manages to decide whether the address is an offset in the constant pool or an object in the heap?

回答1:

The bytecode interpreted by JVM in not necessarily the same bytecode written in .class file. Many JVMs perform so-called bytecode rewriting on different stages of execution.

So does HotSpot JVM. When a class is initialized, HotSpot rewrites ldc bytecodes refering to String entries in the constant pool with JVM-specific fast_aldc bytecode which refers to objects (i.e. java.lang.String instances) in CP cache. When such fast_aldc bytecode is executed for the first time, JVM resolves the constant pool entry, creates a String in Java Heap and populates the CP cache with the reference to this String. Upon further executions of the same bytecode JVM will instantly get the reference from CP cache and push it to Java stack.

After the interpretation of ldc bytecode (or its rewritten form) the top-of-stack will contain a valid reference to an object in Java Heap. The same kind of reference is produced by new bytecode. So there is no need to distinguish reference types.

That's how interpreter works. Of course, after a method gets JIT-compiled, there is no more bytecodes, constant pool references etc. All of these are just abstractions. Just a model.



回答2:

First off, the entire bytecode format is just an abstraction provided by the VM. It does not necessarily have any resemblance to the actual representation of the code or memory at runtime.

Second off, the Constant Pool is a table of up to 65,535 entries that uses 16bit indexes. Since indexing the constant pool with a small index and category 1 type is such a common task, there is a special shorthand instruction for it - ldc.

The ldc instruction uses a single byte index so it is only usable for the first 255 entries. If you want to access entries above that, you need to use the two byte form, ldc_w. The situation is similar to other shorthand instructions, such as aload_3 vs aload 3 vs wide aload 3.

And again, that's all an abstraction. In practice the VM will convert the constant pool to a more friendly internal format and may compile actual pointers to its runtime location into the code. But that's just one possible implementation.