What can explain the huge performance penalty of w

While investigating the subtler consequences of generational garbage collectors on application performance, I have hit a quite staggering discrepancy in the performance of a very basic operation – a simple write to a heap location – with respect to whether the value written is primitive or a reference.

The microbenchmark

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 1, time = 1)
@Measurement(iterations = 3, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(2)
public class Writing
{
  static final int TARGET_SIZE = 1024;

  static final    int[] primitiveArray = new    int[TARGET_SIZE];
  static final Object[] referenceArray = new Object[TARGET_SIZE];

  int val = 1;
  @GenerateMicroBenchmark
  public void fillPrimitiveArray() {
    final int primitiveValue = val++;
    for (int i = 0; i < TARGET_SIZE; i++)
      primitiveArray[i] = primitiveValue;
  }

  @GenerateMicroBenchmark
  public void fillReferenceArray() {
    final Object referenceValue = new Object();
    for (int i = 0; i < TARGET_SIZE; i++)
      referenceArray[i] = referenceValue;
  }
}

The results

Benchmark              Mode Thr    Cnt  Sec         Mean   Mean error    Units
fillPrimitiveArray     avgt   1      6    1       87.891        1.610  nsec/op
fillReferenceArray     avgt   1      6    1      640.287        8.368  nsec/op

Since the whole loop is almost 8 times slower, the write itself is probably more than 10 times slower. What could possibly explain such a slowdown?

The speed of writing out the primitive array is more than 10 writes per nanosecond. Perhaps I should ask the flip-side of my question: what makes primitive writing so fast? (BTW I've checked, the times scale linearly with array size.)

Note that this is all single-threaded; specifying @Threads(2) will increase both measurements, but the ratio will be similar.

A bit of background: the card table and the associated write barrier

An object in the Young Generation could happen to be reachable only from an object in the Old Generation. To avoid collecting live objects, the YG collector must know about any references that were written to the Old Generation area since the last YG collection. This is achieved with a sort of "dirty flag table", called the card table, which has one flag for each block of 512 bytes of heap.

The "ugly" part of the scheme comes when we realize that each and every write of a reference must be accompanied by a card table invariant-maintaining piece of code: the location in the card table which guards the address being written to must be marked as dirty. This piece of code is termed the write barrier.

In specific machine code, this looks as follows:

lea   edx, [edi+ebp*4+0x10]   ; calculate the heap location to write
mov   [edx], ebx              ; write the value to the heap location
shr   edx, 9                  ; calculate the offset into the card table
mov   [ecx+edx], ah           ; mark the card table entry as dirty

And this is all it takes for the same high-level operation when the value written is primitive:

mov   [edx+ebx*4+0x10], ebp

The write barrier appears to contribute "just" one more write, but my measurements show that it causes an order-of-magnitude slowdown. I can't explain this.

`UseCondCardMark` just makes it worse

There is a quite obscure JVM flag which is supposed to avoid the card table write if the entry is already marked dirty. This is important primarily in some degenerate cases where a lot of card table writing causes false sharing between threads via CPU caches. Anyway, I tried with that flag on:

with  -XX:+UseCondCardMark:
Benchmark              Mode Thr    Cnt  Sec         Mean   Mean error    Units
fillPrimitiveArray     avgt   1      6    1       89.913        3.586  nsec/op
fillReferenceArray     avgt   1      6    1     1504.123       12.130  nsec/op

Quoting the authoritative answer provided by Vladimir Kozlov at `hotspot-compiler-dev` mailing list:

Hi Marko,

For primitive arrays we use handwritten assembler code which use XMM registers as vectors for initialization. For object arrays we did not optimize it because it is not common case. We can improve it similar to what we did for arracopy but we decided leave it for now.

Regards,
Vladimir

I have also wondered why the optimized code is not inlined, and got that answer as well:

The code is not small, so we decided to not inline it. Look on MacroAssembler::generate_fill() in macroAssembler_x86.cpp:

http://hg.openjdk.java.net/hsx/hotspot-main/hotspot/file/54f0c207dc35/src/cpu/x86/vm/macroAssembler_x86.cpp

My original answer:

I missed an important bit in the machine code, apparently because I was looking at the On-Stack Replacement version of the compiled method instead of the one used for subsequent calls. It turns out that HotSpot was able to prove that my loop amounts to what a call to Arrays.fill would have done and replaced the entire loop with a call instruction to such code. I can't see that function's code, but it probably uses every possible trick, such as MMX instructions, to fill a block of memory with the same 32-bit value.

This gave me the idea to measure the actual Arrays.fill calls. I got more surprise:

Benchmark                  Mode Thr    Cnt  Sec         Mean   Mean error    Units
fillPrimitiveArray         avgt   1      5    2      155.343        1.318  nsec/op
fillReferenceArray         avgt   1      5    2      682.975       17.990  nsec/op
loopFillPrimitiveArray     avgt   1      5    2      156.114        0.523  nsec/op
loopFillReferenceArray     avgt   1      5    2      682.209        7.047  nsec/op

The results with a loop and with a call to fill are identical. If anything, this is even more confusing than the results which motivated the question. I would have at least expected fill to benefit from the same optimization ideas regardless of value type.