Strange performance drop of JDK8 LocalDate.toEpoch

2019-02-23 07:13发布

问题:

I was curious if we finally get a fast datetime library with JDK8. Nearly all LocalDate computations use toEpochDay so I looked at the source and the large number of divisions and branches made me curious if I could do better.

I eliminated some branching and all but one division, however the speedup is worse than expected. So my first question how is it possible that an algorithm using multiple division takes only about 30 cycles (throughput). Holger's comments seem to have answered this: Division by a small constant gets JIT-ed to multiplication. I did it manually and now I'm consistently beating the original implementation by a factor of 2.

The benchmark is pretty straightforward, just iterate though an array of random LocalDates and convert each of them toEpochDay. Despite the randomness, the results are pretty consistent. The size of the array is a parameter and my main question is where the big slowdown between 2000 and 30000 comes from. There's should be some slowdown as the data no more fit in the L1 cache, but the memory accesses of both algorithms are exactly the same (i.e., none but fetching the date from the array).

The still open question is: How it comes that the behaviour of two simple memory-access-free implementations of the same function change when iterating over an array? The original algorithm suffers a much bigger slowdown than mine.

My algorithm is probably not worth copying here, it's undocumented and about as cryptic as the original, and there's only a very rudimentary test.

回答1:

I haven't catched the reason directly, but it is certainly a benchmarking framework shortcoming. Something related to GC and per-invocation costs. I have the same performance degradation with JMH, except bench with 100 dates shows better perf than with 2000 dates, too. I've tried to create the dates array always of maximum size, and iterate just first 100, 2000, 30000 elements. In this case all versions performed equally (15.3 +- 0.3 ns on my machine).

import org.openjdk.jmh.annotations.*;

import java.time.LocalDate;
import java.util.*;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;


@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(LocalDateBenchmark.ITERATIONS)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class LocalDateBenchmark {
    public static final int MAX_ITERATIONS = 1000000;
    public static final int ITERATIONS = 30000;

    private static final LocalDate MIN_DATE = LocalDate.of(1900, 1, 1);
    private static final LocalDate MAX_DATE = LocalDate.of(2100, 1, 1);
    private static final int DAYS_BETWEEN = (int) (MAX_DATE.toEpochDay() - MIN_DATE.toEpochDay());

    public LocalDate[] dates = new LocalDate[MAX_ITERATIONS];
    private Random random;

    @Setup(Level.Trial)
    public void setUpAll() {
        Random r = ThreadLocalRandom.current();
        for (int i=0; i< dates.length; ++i) {
            dates[i] = MIN_DATE.plusDays(r.nextInt(DAYS_BETWEEN));
        }
    }

    @Setup(Level.Iteration)
    public void setUpRandom() {
        random = new Random();
    }

    @GenerateMicroBenchmark
    public int timeToEpochDay(LocalDateBenchmark state) {
        int result = 0;
        LocalDate[] dates = state.dates;
        int offset = random.nextInt(MAX_ITERATIONS - ITERATIONS);
        for (int i = offset; i < offset + ITERATIONS; i++) {
            LocalDate date = dates[i];
            result += date.toEpochDay();
        }
        return result;
    }
}


回答2:

thats because there are no divisions in the algorithm. all the / 4 are replaced by shifts. and all the / 100 are actually * 0.01's. the divisions are there for readbility (hehe). im not sure if this optimization happens during bytecode emission or JIT compilation, it would be interesting to look at the class file and find out.