Why direct memory 'array' is slower to cle

I've set up a JMH benchmark to measure what would be faster Arrays.fill with null, System.arraycopy from a null array, zeroying a DirectByteBuffer or zeroying an unsafe memory block trying to answer this question Let's put aside that zeroying a directly allocated memory is a rare case, and discuss the results of my benchmark.

Here's the JMH benchmark snippet (full code available via a gist) including unsafe.setMemory case as suggested by @apangin in the original post, byteBuffer.put(byte[], offset, length) and longBuffer.put(long[], offset, length) as suggested by @jan-schaefer:

@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void arrayFill() {
    Arrays.fill(objectHolderForFill, null);
}

@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void arrayCopy() {
    System.arraycopy(nullsArray, 0, objectHolderForArrayCopy, 0, objectHolderForArrayCopy.length);
}

@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directByteBufferManualLoop() {
    while (referenceHolderByteBuffer.hasRemaining()) {
        referenceHolderByteBuffer.putLong(0);
    }
}

@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directByteBufferBatch() {
    referenceHolderByteBuffer.put(nullBytes, 0, nullBytes.length);
}

@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directLongBufferManualLoop() {
    while (referenceHolderLongBuffer.hasRemaining()) {
        referenceHolderLongBuffer.put(0L);
    }
}

@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directLongBufferBatch() {
    referenceHolderLongBuffer.put(nullLongs, 0, nullLongs.length);
}


@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void unsafeArrayManualLoop() {
    long addr = referenceHolderUnsafe;
    long pos = 0;
    for (int i = 0; i < size; i++) {
        unsafe.putLong(addr + pos, 0L);
        pos += 1 << 3;
    }
}

@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void unsafeArraySetMemory() {
    unsafe.setMemory(referenceHolderUnsafe, size*8, (byte) 0);
}

Here's what I got (Java 1.8, JMH 1.13, Core i3-6100U 2.30 GHz, Win10):

100 elements
Benchmark                                       Mode      Cnt   Score   Error    Units
ArrayNullFillBench.arrayCopy                   sample  5234029  39,518 ± 0,991    ns/op
ArrayNullFillBench.directByteBufferBatch       sample  6271334  43,646 ± 1,523    ns/op
ArrayNullFillBench.directLongBufferBatch       sample  4615974  45,252 ± 2,352    ns/op
ArrayNullFillBench.arrayFill                   sample  4745406  76,997 ± 3,547    ns/op
ArrayNullFillBench.unsafeArrayManualLoop       sample  5980381  78,811 ± 2,870    ns/op
ArrayNullFillBench.unsafeArraySetMemory        sample  5985884  85,062 ± 2,096    ns/op
ArrayNullFillBench.directLongBufferManualLoop  sample  4697023  116,242 ± 2,579   ns/op WOW
ArrayNullFillBench.directByteBufferManualLoop  sample  7504629  208,440 ± 10,651  ns/op WOW

I skipped all the loop implementations (except arrayFill for scale) from further tests

1000 elements
Benchmark                                 Mode      Cnt    Score   Error    Units
ArrayNullFillBench.arrayCopy              sample  6780681  184,516 ± 14,036  ns/op
ArrayNullFillBench.directLongBufferBatch  sample  4018778  293,325 ± 4,074   ns/op
ArrayNullFillBench.directByteBufferBatch  sample  4063969  313,171 ± 4,861   ns/op
ArrayNullFillBench.arrayFill              sample  6862928  518,886 ± 6,372   ns/op

10000 elements
Benchmark                                 Mode      Cnt     Score   Error    Units
ArrayNullFillBench.arrayCopy              sample  2551851  2024,543 ± 12,533  ns/op
ArrayNullFillBench.directLongBufferBatch  sample  2958517  4469,210 ± 10,376  ns/op
ArrayNullFillBench.directByteBufferBatch  sample  2892258  4526,945 ± 33,443  ns/op
ArrayNullFillBench.arrayFill              sample  5689507  5028,592 ± 9,074   ns/op

Could you please clarify the following questions:

1. Why `unsafeArraySetMemory` is a bit but slower than `unsafeArrayManualLoop`?
2. Why directByteBuffer is 2.5X-5X slower than others?

回答1:

Why unsafeArraySetMemory is a bit but slower than unsafeArrayManualLoop?

My guess is that it not as well optimised for setting exactly multiple longs. It has to check whether you have something, not quite a multiple of 8.

Why directByteBuffer is by an order of magnitude slower than others?

An order of magnitude would be around 10x, it is about 2.5x slower. It has to bounds check every access and update a field instead of a local variable.

NOTE: I have found the JVM doesn't always loop unroll code with Unsafe. You might try doing that yourself to see if it helps.

NOTE: Native code can use XMM 128 bit instructions and is using this increasingly which is why the copy might be so fast. Access to XMM instruction may come in Java 10.

回答2:

The comparison is a bit unfair. You are using a single operation when using Array.fill and System.arraycopy, but you are using a loop and multiple invocations of putLong in the DirectByteBuffer case. If you look at the implementation of putLong you will see that there is a lot going on there like checking accessibility, for example. You should try to use a batch operation like put(long[] src, int srcOffset, int longCount) and see what happens.