Strange JIT pessimization of a loop idiom

While analyzing the results of a recent question here, I encountered a quite peculiar phenomenon: apparently an extra layer of HotSpot's JIT-optimization actually slows down execution on my machine.

Here is the code I have used for the measurement:

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(Measure.ARRAY_SIZE)
@Warmup(iterations = 2, time = 1)
@Measurement(iterations = 5, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(2)
public class Measure
{
  public static final int ARRAY_SIZE = 1024;
  private final int[] array = new int[ARRAY_SIZE];

  @Setup public void setup() {
    final Random random = new Random();
    for (int i = 0; i < ARRAY_SIZE; ++i) {
      final int x = random.nextInt();
      array[i] = x == 0? 1 : x;
    }
  }

  @GenerateMicroBenchmark public int normalIndex() {
    final int[] array = this.array;
    int result = 0;
    for (int i = 0; i < array.length; i++) {
      final int j = i & array.length-1;
      final int entry = array[i];
      result ^= entry + j;
    }
    return result;
  }

  @GenerateMicroBenchmark public int maskedIndex() {
    final int[] array = this.array;
    int result = 0;
    for (int i = 0; i < array.length; i++) {
      final int j = i & array.length-1;
      final int entry = array[j];
      result ^= entry + i;
    }
    return result;
  }

  @GenerateMicroBenchmark public int normalWithExitPoint() {
    final int[] array = this.array;
    int result = 0;
    for (int i = 0; i < array.length; i++) {
      final int j = i & array.length-1;
      final int entry = array[i];
      result ^= entry + j;
      if (entry == 0) break;
    }
    return result;
  }

  @GenerateMicroBenchmark public int maskedWithExitPoint() {
    final int[] array = this.array;
    int result = 0;
    for (int i = 0; i < array.length; i++) {
      final int j = i & array.length-1;
      final int entry = array[j];
      result ^= entry + i;
      if (entry == 0) break;
    }
    return result;
  }


}

The code is quite subtle so let me point out the important bits:

the "normal index" variants use the straight variable i for array index. HotSpot can easily determine the range of i throughout the loop and eliminate the array bounds check;
the "masked index" variants index by j, which is actually equal to i, but that fact is "hidden" from HotSpot through the AND-masking operation;
the "with exit point" variants introduce an explict loop exit point. The importance of this will be explained below.

Loop unrolling and reordering

Observe that the bounds check figures in two important ways:

it has runtime overhead associated with it (a comparison followed by a conditional branch);
it constitutes a loop exit point which can interrupt the loop on any step. This turns out to have important consequences for the applicable JIT optimizations.

By inspecting the machine code emitted for the above four methods, I have noticed the following:

in all cases the loop is unrolled;
in the case of normalIndex, which is distinguished as the only one having no premature loop exit points, the operations of all the unrolled steps are reordered such that first all the array fetches are performed, followed by XOR-ing all the values into the accumulator.

Expected and actual measurement results

Now we can classify the four methods according to the discussed features:

normalIndex has no bounds checks and no loop exit points;
normalWithExitPoint has no bounds checks and 1 exit point;
maskedIndex has 1 bounds check and 1 exit point;
maskedWithExitPoint has 1 bounds check and 2 exit points.

The obvious expectation is that the above list should present the methods in descending order of performance; however, these are my actual results:

Benchmark               Mode   Samples         Mean   Mean error    Units
normalIndex             avgt        20        0.946        0.010    ns/op
normalWithExitPoint     avgt        20        0.807        0.010    ns/op
maskedIndex             avgt        20        0.803        0.007    ns/op
maskedWithExitPoint     avgt        20        1.007        0.009    ns/op

normalWithExitPoint and maskedIndex are identical modulo measurement error, even though only the latter has the bounds check;
the greatest anomaly is observed on normalIndex which should have been the fastest, but is significantly slower than normalWithExitPoint, identical to it in every respect save for having one more line of code, the one which introduces the exit point.

Since normalIndex is the only method which has the extra reordering "optimization" applied to it, the conclusion is that this is the cause of the slowdown.

I am testing on:

Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode) (Java 7 Update 40)
OS X Version 10.9.1
2.66 GHz Intel Core i7

I have successfully reproduced the results on Java 8 EA b118 as well.

My question:

Is the above phenomenon reproducible on other, similar machines? From the question mentioned at the beginning I already have a hint that at least some machines do not reproduce it, so another result from the same CPU would be very interesting.

Update 1: more measurements inspired by maaartinus's findings

I have gathered the following table which correlates execution times with the -XX:LoopUnrollLimit command-line argument. Here I focus only on two variants, with and without the if (entry == 0) break; line:

LoopUnrollLimit:   14 15 18 19 22 23 60
withExitPoint:     96 95 95 79 80 80 69   1/100 ns
withoutExitPoint:  94 64 64 63 64 77 75   1/100 ns

The following sudden changes can be observed:

on the transition from 14 to 15, the withoutExitPoint variant receives a beneficial LCM¹ transformation and is significantly sped up. Due to the loop-unrolling limit, all loaded values fit into registers;
on 18->19, the withExitPoint variant receives a speedup, lesser than the above;
on 22->23, the withoutExitPoint variant is slowed down. At this point I see the spill-over into stack locations, as described in maaartinus's answer, starting to happen.

The default loopUnrollLimit for my setup is 60, so I present its results in the last column.

¹ LCM = Local Code Motion. It is the transformation which causes all array access to happen on the top, followed by the processing of the loaded values.

Update 2: this is actually a known, reported problem

https://bugs.openjdk.java.net/browse/JDK-7101232

Appendix: The unrolled and reordered loop of `normalIndex` in machine code

0x00000001044a37c0: mov    ecx,eax
0x00000001044a37c2: and    ecx,esi            ;*iand
                                              ; - org.sample.Measure::normalIndex@20 (line 44)
0x00000001044a37c4: mov    rbp,QWORD PTR [rsp+0x28]  ;*iload_3
                                              ; - org.sample.Measure::normalIndex@15 (line 44)
0x00000001044a37c9: add    ecx,DWORD PTR [rbp+rsi*4+0x10]
0x00000001044a37cd: xor    ecx,r8d
0x00000001044a37d0: mov    DWORD PTR [rsp],ecx
0x00000001044a37d3: mov    r10d,esi
0x00000001044a37d6: add    r10d,0xf
0x00000001044a37da: and    r10d,eax
0x00000001044a37dd: mov    r8d,esi
0x00000001044a37e0: add    r8d,0x7
0x00000001044a37e4: and    r8d,eax
0x00000001044a37e7: mov    DWORD PTR [rsp+0x4],r8d
0x00000001044a37ec: mov    r11d,esi
0x00000001044a37ef: add    r11d,0x6
0x00000001044a37f3: and    r11d,eax
0x00000001044a37f6: mov    DWORD PTR [rsp+0x8],r11d
0x00000001044a37fb: mov    r8d,esi
0x00000001044a37fe: add    r8d,0x5
0x00000001044a3802: and    r8d,eax
0x00000001044a3805: mov    DWORD PTR [rsp+0xc],r8d
0x00000001044a380a: mov    r11d,esi
0x00000001044a380d: inc    r11d
0x00000001044a3810: and    r11d,eax
0x00000001044a3813: mov    DWORD PTR [rsp+0x10],r11d
0x00000001044a3818: mov    r8d,esi
0x00000001044a381b: add    r8d,0x2
0x00000001044a381f: and    r8d,eax
0x00000001044a3822: mov    DWORD PTR [rsp+0x14],r8d
0x00000001044a3827: mov    r11d,esi
0x00000001044a382a: add    r11d,0x3
0x00000001044a382e: and    r11d,eax
0x00000001044a3831: mov    r9d,esi
0x00000001044a3834: add    r9d,0x4
0x00000001044a3838: and    r9d,eax
0x00000001044a383b: mov    r8d,esi
0x00000001044a383e: add    r8d,0x8
0x00000001044a3842: and    r8d,eax
0x00000001044a3845: mov    DWORD PTR [rsp+0x18],r8d
0x00000001044a384a: mov    r8d,esi
0x00000001044a384d: add    r8d,0x9
0x00000001044a3851: and    r8d,eax
0x00000001044a3854: mov    ebx,esi
0x00000001044a3856: add    ebx,0xa
0x00000001044a3859: and    ebx,eax
0x00000001044a385b: mov    ecx,esi
0x00000001044a385d: add    ecx,0xb
0x00000001044a3860: and    ecx,eax
0x00000001044a3862: mov    edx,esi
0x00000001044a3864: add    edx,0xc
0x00000001044a3867: and    edx,eax
0x00000001044a3869: mov    edi,esi
0x00000001044a386b: add    edi,0xd
0x00000001044a386e: and    edi,eax
0x00000001044a3870: mov    r13d,esi
0x00000001044a3873: add    r13d,0xe
0x00000001044a3877: and    r13d,eax
0x00000001044a387a: movsxd r14,esi
0x00000001044a387d: add    r10d,DWORD PTR [rbp+r14*4+0x4c]
0x00000001044a3882: mov    DWORD PTR [rsp+0x24],r10d
0x00000001044a3887: mov    QWORD PTR [rsp+0x28],rbp
0x00000001044a388c: mov    ebp,DWORD PTR [rsp+0x4]
0x00000001044a3890: mov    r10,QWORD PTR [rsp+0x28]
0x00000001044a3895: add    ebp,DWORD PTR [r10+r14*4+0x2c]
0x00000001044a389a: mov    DWORD PTR [rsp+0x4],ebp
0x00000001044a389e: mov    r10d,DWORD PTR [rsp+0x8]
0x00000001044a38a3: mov    rbp,QWORD PTR [rsp+0x28]
0x00000001044a38a8: add    r10d,DWORD PTR [rbp+r14*4+0x28]
0x00000001044a38ad: mov    DWORD PTR [rsp+0x8],r10d
0x00000001044a38b2: mov    r10d,DWORD PTR [rsp+0xc]
0x00000001044a38b7: add    r10d,DWORD PTR [rbp+r14*4+0x24]
0x00000001044a38bc: mov    DWORD PTR [rsp+0xc],r10d
0x00000001044a38c1: mov    r10d,DWORD PTR [rsp+0x10]
0x00000001044a38c6: add    r10d,DWORD PTR [rbp+r14*4+0x14]
0x00000001044a38cb: mov    DWORD PTR [rsp+0x10],r10d
0x00000001044a38d0: mov    r10d,DWORD PTR [rsp+0x14]
0x00000001044a38d5: add    r10d,DWORD PTR [rbp+r14*4+0x18]
0x00000001044a38da: mov    DWORD PTR [rsp+0x14],r10d
0x00000001044a38df: add    r13d,DWORD PTR [rbp+r14*4+0x48]
0x00000001044a38e4: add    r11d,DWORD PTR [rbp+r14*4+0x1c]
0x00000001044a38e9: add    r9d,DWORD PTR [rbp+r14*4+0x20]
0x00000001044a38ee: mov    r10d,DWORD PTR [rsp+0x18]
0x00000001044a38f3: add    r10d,DWORD PTR [rbp+r14*4+0x30]
0x00000001044a38f8: mov    DWORD PTR [rsp+0x18],r10d
0x00000001044a38fd: add    r8d,DWORD PTR [rbp+r14*4+0x34]
0x00000001044a3902: add    ebx,DWORD PTR [rbp+r14*4+0x38]
0x00000001044a3907: add    ecx,DWORD PTR [rbp+r14*4+0x3c]
0x00000001044a390c: add    edx,DWORD PTR [rbp+r14*4+0x40]
0x00000001044a3911: add    edi,DWORD PTR [rbp+r14*4+0x44]
0x00000001044a3916: mov    r10d,DWORD PTR [rsp+0x10]
0x00000001044a391b: xor    r10d,DWORD PTR [rsp]
0x00000001044a391f: mov    ebp,DWORD PTR [rsp+0x14]
0x00000001044a3923: xor    ebp,r10d
0x00000001044a3926: xor    r11d,ebp
0x00000001044a3929: xor    r9d,r11d
0x00000001044a392c: xor    r9d,DWORD PTR [rsp+0xc]
0x00000001044a3931: xor    r9d,DWORD PTR [rsp+0x8]
0x00000001044a3936: xor    r9d,DWORD PTR [rsp+0x4]
0x00000001044a393b: mov    r10d,DWORD PTR [rsp+0x18]
0x00000001044a3940: xor    r10d,r9d
0x00000001044a3943: xor    r8d,r10d
0x00000001044a3946: xor    ebx,r8d
0x00000001044a3949: xor    ecx,ebx
0x00000001044a394b: xor    edx,ecx
0x00000001044a394d: xor    edi,edx
0x00000001044a394f: xor    r13d,edi
0x00000001044a3952: mov    r8d,DWORD PTR [rsp+0x24]
0x00000001044a3957: xor    r8d,r13d           ;*ixor
                                              ; - org.sample.Measure::normalIndex@34 (line 46)
0x00000001044a395a: add    esi,0x10           ;*iinc
                                              ; - org.sample.Measure::normalIndex@36 (line 43)
0x00000001044a395d: cmp    esi,DWORD PTR [rsp+0x20]
0x00000001044a3961: jl     0x00000001044a37c0  ;*if_icmpge
                                              ; - org.sample.Measure::normalIndex@12 (line 43)

The reason why the JITC tries to group everything together is rather unclear to me. AFAIK there are (were?) architectures for which grouping of two loads leads to a better performance (some early Pentium, I think).

As JITC knows the hot spots, it can inline much more aggressively than an ahead-of-time compiler, so it does it 16 times in this case. I can't see any clear advantage thereof here, except for making the looping relatively cheaper. I also doubt that there's any architecture profiting from grouping 16 loads together.

The code computes 16 temporary values, one per iteration of

int j = i & array.length-1;
int entry = array[i];
int tmp = entry + j;
result ^= tmp;

Each computation is pretty trivial, one AND, one LOAD, and one ADD. The values are to be mapped to registers, but there aren't enough of them. So the values have to be stored and loaded later.

This happens for 7 out of the 16 registers and increases the costs significantly.

Update

I'm not very sure about verifying this by using -XX:LoopUnrollLimit:

LoopUnrollLimit Benchmark   Mean   Mean error    Units

 8 ..normalIndex           0.902        0.004    ns/op
 8 ..normalWithExitPoint   0.913        0.005    ns/op
 8 ..maskedIndex           0.918        0.006    ns/op
 8 ..maskedWithExitPoint   0.996        0.008    ns/op

16 ..normalIndex           0.769        0.003    ns/op
16 ..normalWithExitPoint   0.930        0.004    ns/op
16 ..maskedIndex           0.937        0.004    ns/op
16 ..maskedWithExitPoint   1.012        0.003    ns/op

32 ..normalIndex           0.814        0.003    ns/op
32 ..normalWithExitPoint   0.816        0.005    ns/op
32 ..maskedIndex           0.838        0.003    ns/op
32 ..maskedWithExitPoint   0.978        0.002    ns/op

 - ..normalIndex           0.830        0.002    ns/op
 - ..normalWithExitPoint   0.683        0.002    ns/op
 - ..maskedIndex           0.791        0.005    ns/op
 - ..maskedWithExitPoint   0.908        0.003    ns/op

The limit of 16 makes normalIndex be the fastest variant, which indicates that I was right with the "overallocation penalty". Bur according to Marko, the generated assembly changes with the unroll limit also in other aspects, so things are more complicated.