Why lambda IntStream.anyMatch() is 10 slower than

I was recently profiling my code and found one interesting bottleneck in it. Here is benchmark :

@BenchmarkMode(Mode.Throughput)
@Fork(1)
@State(Scope.Thread)
@Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
public class Contains {

    private int[] ar = new int[] {1,2,3,4,5,6,7};

    private int val = 5;

    @Benchmark
    public boolean naive() {
        return contains(ar, val);
    }

    @Benchmark
    public boolean lambdaArrayStreamContains() {
        return Arrays.stream(ar).anyMatch(i -> i == val);
    }

    @Benchmark
    public boolean lambdaIntStreamContains() {
        return IntStream.of(ar).anyMatch(i -> i == val);
    }

    private static boolean contains(int[] ar, int value) {
        for (int arVal : ar) {
            if (arVal == value) {
                return true;
            }
        }
        return false;
    }

}

Result :

Benchmark                            Mode  Cnt       Score      Error  Units
Contains.lambdaArrayStreamContains  thrpt   10   22867.962 ± 1049.649  ops/s
Contains.lambdaIntStreamContains    thrpt   10   22983.800 ±  593.580  ops/s
Contains.naive                      thrpt   10  228002.406 ± 8591.186  ops/s

If shows that Array contains operation via lambda is 10 times slower than naive implementation with simple loop. I knew that lambdas should be a bit slower. But 10 times? Am I doing wrong lambda or this is some issue with java?

标签： java performance lambda microbenchmark

1条回答

Summer. ? 凉城

2楼-- · 2019-01-26 01:16

Your benchmark does not actually measure anyMatch performance, but rather a stream overhead. This overhead can appear significant when comparing to a very simple operation like a five-element array lookup.

The slowdown will not look so terrible if we swtich from relative to absolute numbers. Let's measure latency instead of throughput for a clearer picture. I've omitted lambdaIntStream benchmark since it works exactly the same way as lambdaArrayStream.

Benchmark                   Mode  Cnt   Score   Error  Units
Contains.lambdaArrayStream  avgt    5  53,242 ± 2,034  ns/op
Contains.naive              avgt    5   5,876 ± 0,404  ns/op

5.8 ns is roughly 14 cycles of a 2.4 GHz CPU. The workload is so small that any extra cycle will be noticeable. So what is the overhead of stream operations?

Object allocation

Now rerun the benchmark with -prof gc profiler. It will show the amount of heap allocations:

Benchmark                                       Mode  Cnt     Score     Error   Units
Contains.lambdaArrayStream:·gc.alloc.rate.norm  avgt    5   152,000 ±   0,001    B/op
Contains.naive:·gc.alloc.rate.norm              avgt    5    ≈ 10⁻⁵              B/op

lambdaArrayStream allocates 152 bytes per iteration while naive benchmark allocates nothing. Of course, allocation is not free: there are at least 5 objects constructed to support anyMatch, and each takes several nanoseconds:

Lambda i -> i == val
IntPipeline.Head
Spliterators.IntArraySpliterator
MatchOps.MatchOp
MatchOps.MatchSink

Call stack

java.util.stream implementation is a bit complicated since it must support all combinations of stream sources, intermediate and terminal operations. If you look at the call stack of anyMatch in your benchmark, you'll see something like that:

    at bench.Contains.lambda$lambdaArrayStream$0(Contains.java:24)
    at java.util.stream.MatchOps$2MatchSink.accept(MatchOps.java:119)
    at java.util.Spliterators$IntArraySpliterator.tryAdvance(Spliterators.java:1041)
    at java.util.stream.IntPipeline.forEachWithCancel(IntPipeline.java:162)
    at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:485)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.MatchOps$MatchOp.evaluateSequential(MatchOps.java:230)
    at java.util.stream.MatchOps$MatchOp.evaluateSequential(MatchOps.java:196)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.IntPipeline.anyMatch(IntPipeline.java:477)
    at bench.Contains.lambdaArrayStream(Contains.java:23)

Not all of these method calls can be inlined. Furthermore, JVM limits inlining to 9 levels, but here we see the deeper call stack. If we override the limit with -XX:MaxInlineLevel=20 the score will become a bit better:

Benchmark                   Mode  Cnt   Score   Error  Units
Contains.lambdaArrayStream  avgt    5  33,294 ± 0,367  ns/op  (was 53,242)
Contains.naive              avgt    5   5,822 ± 0,207  ns/op

Loop optimizations

for iteration over an array is a trivial counted loop. JVM can apply a wide range of loop optimizatons here: loop peeling, loop unrolling etc. This does not work for a while-kind loop in forEachWithCancel method, which is used to traverse an IntStream. The effect of loop optimizations can be measured with -XX:LoopUnrollLimit=0 -XX:-UseLoopPredicate:

Benchmark                   Mode  Cnt   Score   Error  Units
Contains.lambdaArrayStream  avgt    5  33,153 ± 0,559  ns/op
Contains.naive              avgt    5   9,853 ± 0,150  ns/op  (was 5,876)

Conclusions

There is some overhead to construct and to traverse a stream, but this is completely understood and cannot be considered a bug. I would not say the overhead is big (even 50 ns/op is not that much); however, in this particular example the overhead is dominant because of an extremely tiny workload.

0人赞添加讨论(0) 举报