Large performance gap between CPU's div instru

Since the beginning of CPUs it has been general knowledge that the integer division instruction is expensive. I went to see how bad it is today, on CPUs which have the luxury of billions of transistors. I found that the hardware idiv instruction still performs significantly worse for constant divisors than the code the JIT compiler is able to emit, which doesn't contain the idiv instruction.

To bring this out in a dedicated microbenchmark I've written the following:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@OperationsPerInvocation(MeasureDiv.ARRAY_SIZE)
@Warmup(iterations = 8, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@State(Scope.Thread)
@Fork(1)
public class MeasureDiv
{
  public static final int ARRAY_SIZE = 128;
  public static final long DIVIDEND_BASE = 239520948509234807L;
  static final int DIVISOR = 10;
  final long[] input = new long[ARRAY_SIZE];

  @Setup(Level.Iteration) public void setup() {
    for (int i = 0; i < input.length; i++) {
      input[i] = DIVISOR;
    }
  }

  @Benchmark public long divVar() {
    long sum = 0;
    for (int i = 0; i < ARRAY_SIZE; i++) {
      final long in = input[i];
      final long dividend = DIVIDEND_BASE + i;
      final long divisor = in;
      final long quotient = dividend / divisor;
      sum += quotient;
    }
    return sum;
  }

  @Benchmark public long divConst() {
    long sum = 0;
    for (int i = 0; i < ARRAY_SIZE; i++) {
      final long in = input[i];
      final long dividend = DIVIDEND_BASE + in;
      final int divisor = DIVISOR;
      final long quotient = dividend / divisor;
      sum += quotient;
    }
    return sum;
  }
}

In a nutshell, I have two methods identical in every respect except that one (divVar) performs division by a number read off an array while the other divides by a compile-time constant. These are the results:

Benchmark            Mode  Cnt  Score   Error  Units
MeasureDiv.divConst  avgt    5  1.228 ± 0.032  ns/op
MeasureDiv.divVar    avgt    5  8.913 ± 0.192  ns/op

The performance ratio is quite extraordinary. My expectation would be that a modern Intel processor has enough real estate, and its engineers have enough interest, to implement a complex but performant division algorithm in hardware. Yet the JIT compiler beats Intel by sending it a stream of some other instructions which perform the same job, just seven times faster. If anything, dedicated microcode should be able to utilize the CPU even better than what JIT can do through the public API of assembly instructions.

How come idiv is still so much slower, what is the fundamental limitation?

One explanation that comes to mind is the hypothetical existence of a division algorithm which involves the dividend for the first time very late into the process. The JIT compiler would then have a head start because it would evaluate the first part which involves only the divisor at compile time and emit just the second part of the algorithm as runtime code. Is that hypothesis true?

As explained by user pvg through the comments, the hypothesized algorithm is indeed in existence and the best one currently known. The algorithm involves division by the same divisor in the preparatory step, so it is fundamentally irreducible as a whole. It is covered in Chapter 10 of the classic publication Hacker's Delight.