How could this Java code be sped up?

2019-05-14 20:29发布

I am trying to benchmark how fast can Java do a simple task: read a huge file into memory and then perform some meaningless calculations on the data. All types of optimizations count. Whether it's rewriting the code differently or using a different JVM, tricking JIT ..

Input file is a 500 million long list of 32 bit integer pairs separated by a comma. Like this:

44439,5023
33140,22257
...

This file takes 5.5GB on my machine. The program can't use more than 8GB of RAM and can use only a single thread.

package speedracer;

import java.io.FileInputStream;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;

public class Main
{
    public static void main(String[] args)
    {
        int[] list = new int[1000000000];

        long start1 = System.nanoTime();
        parse(list);
        long end1 = System.nanoTime();

        System.out.println("Parsing took: " + (end1 - start1) / 1000000000.0);

        int rs = 0;
        long start2 = System.nanoTime();

        for (int k = 0; k < list.length; k++) {
            rs = calc(list[k++], list[k++], list[k++], list[k]);
        }

        long end2 = System.nanoTime();

        System.out.println(rs);
        System.out.println("Calculations took: " + (end2 - start2) / 1000000000.0);
    }

    public static int calc(final int a1, final int a2, final int b1, final int b2)
    {
        int c1 = (a1 + a2) ^ a2;
        int c2 = (b1 - b2) << 4;

        for (int z = 0; z < 100; z++) {
            c1 ^= z + c2;
        }

        return c1;
    }

    public static void parse(int[] list)
    {
        FileChannel fc = null;
        int i = 0;

        MappedByteBuffer byteBuffer;

        try {
            fc = new FileInputStream("in.txt").getChannel();

            long size = fc.size();
            long allocated = 0;
            long allocate = 0;

            while (size > allocated) {

               if ((size - allocated) > Integer.MAX_VALUE) {
                   allocate = Integer.MAX_VALUE;
               } else {
                   allocate = size - allocated;
               }

               byteBuffer = fc.map(FileChannel.MapMode.READ_ONLY, allocated, allocate);
               byteBuffer.clear();

               allocated += allocate;

               int number = 0;

               while (byteBuffer.hasRemaining()) {
                   char val = (char) byteBuffer.get();
                   if (val == '\n' || val == ',') {
                        list[i] = number;

                        number = 0;
                        i++;
                   } else {
                       number = number * 10 + (val - '0');
                   }
                }
            }

            fc.close();

        } catch (Exception e) {
            System.err.println("Parsing error: " + e);
        }
    }
}

I've tried all I could think of. Trying different readers, tried openjdk6, sunjdk6, sunjdk7. Tried different readers. Had to do some ugly parsing since MappedByteBuffer cannot map more than 2GB of memory at once. I'm running:

   Linux AS292 2.6.38-11-generic #48-Ubuntu SMP 
   Fri Jul 29 19:02:55 UTC 2011 
   x86_64 GNU/Linux. Ubuntu 11.04. 
   CPU: is Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz.

Currently, my results are for parsing: 26.50s, calculations: 11.27s. I'm competing against a similar C++ benchmark which does the IO in roughly the same time but the calculations take only 4.5s. My main objective is to reduce the calculation time in any means possible. Any ideas?

Update: It seems the main speed improvement could come from what is called Auto-Vectorization. I was able to find some hints that the current Sun's JIT only does "some vectorization" however I can't really confirm it. It would be great to find some JVM or JIT that would have better auto-vectorization optimization support.

7条回答
We Are One
2楼-- · 2019-05-14 21:01

I am trying to benchmark how fast can Java do a simple task: read a huge file into memory and then perform some meaningless calculations on the data.

If the task is to do a meaningless calculation, then the best optimization is to not do the calculation.

If what you are really trying to do here is to figure out if there is a general technique to make a computation go faster, then I think you are barking up the wrong tree. There is no such technique. What you learn on optimizing a meaningless calculation is not likely to apply to other (hopefully meaningfull) calculations.

If calculation is not meaningless, and the aim is to make the whole program go faster, you've probably already reached the point where optimization is a waste of time.

  • Current (Java) - 26.50s + 11.27s = ~38 seconds
  • Goal (C++) - ~26.5s + 4.50 = ~31 seconds
  • Percentage speedup - less than 20%.

A speedup of less than 20% for a ~40 second computation is probably not worth the effort. It is cheaper to get the user to twiddle his thumbs for those extra 7 seconds.


This is also telling you something interesting. That in this scenario, it doesn't make much difference in relative terms whether you use C++ or Java. The overall performance of the program is dominated by a phase in which C++ and Java are comparable.

查看更多
唯我独甜
3楼-- · 2019-05-14 21:06

First of all, -O3 enables:

-finline-functions
-ftree-vectorize

among others...

So it looks like it actually might be vectorizing.

EDIT : This has been been confirmed. (see comments) The C++ version is indeed being vectorized by the compiler. With vectorization disabled, the C++ version actually runs a bit slower than the Java version

Assuming the JIT does not vectorize the loop, it may be difficult/impossible for the Java version to match the speed of the C++ version.


Now, if I were a smart C/C++ compiler, here's how I would arrange that loop (on x64):

int c1 = (a1 + a2) ^ a2;
int c2 = (b1 - b2) << 4;

int tmp0 = c1;
int tmp1 = 0;
int tmp2 = 0;
int tmp3 = 0;

int z0 = 0;
int z1 = 1;
int z2 = 2;
int z3 = 3;

do{
    tmp0 ^= z0 + c2;
    tmp1 ^= z1 + c2;
    tmp2 ^= z2 + c2;
    tmp3 ^= z3 + c2;
    z0 += 4;
    z1 += 4;
    z2 += 4;
    z3 += 4;
}while (z0 < 100);

tmp0 ^= tmp1;
tmp2 ^= tmp3;

tmp0 ^= tmp2;

return tmp0;

Note that this loop is completely vectorizable.

Even better, I would completely unroll this loop. These are things that a C/C++ compiler will do. But now the question, is will the JIT do it?

查看更多
爷、活的狠高调
4楼-- · 2019-05-14 21:06

The MappedByteBuffer is only contributing about 20% in I/O performance and it is an enormous memory cost - if it causes swapping the cure is worse than the disease.

I would use a BufferedReader around a FileReader, and maybe a Scanner around that to get the integers, or at least Integer.parseInt(), which is a lot more likely to have been warmed up by HotSpot than your own radix conversion code.

查看更多
别忘想泡老子
5楼-- · 2019-05-14 21:11

Interesting question. :-) This is probably more of a comment since I won't really answer your question, but it's too long for the comment box.

Micro-benchmarking in Java is tricky because the JIT can go nuts with optimizations. But this particular code tricks the JIT in such a way that it somehow cannot perform its normal optimizations.

Normally, this code would run in O(1) time because your main loop has no effect on anything:

    for (int k = 0; k < list.length; k++) {
        rs = calc(list[k++], list[k++], list[k++], list[k]);
    }

Note that the final result of rs doesn't really depend on running all iterations of the loop; just the last one. You can calculate the final value of "k" for the loop without having to actually run the loop. Normally the JIT would notice that and turn your loop into a single assignment, it it's able to detect that the function being called (calc) has no side-effects (which it doesn't).

But, somehow, this statement in the calc() function messes up the JIT:

        c1 ^= z + c2;

Somehow that adds too much complexity for the JIT to decide that all this code in the end doesn't change anything and that the original loop can be optimized out.

If you change that particular statement to something even more pointless, like:

        c1 = z + c2;

Then the JIT picks things up and optimizes your loops away. Try it out. :-)

I tried locally with a much smaller data set and with the "^=" version calculations took ~1.6s, while with the "=" version they took 0.007 seconds (or, in other words, it optimized away the loop).

As I said, not really a response, but I thought this might be interesting.

查看更多
来,给爷笑一个
6楼-- · 2019-05-14 21:18

Did you try "inlining" parse() and calc(), i.e. put all the code in main()?

查看更多
狗以群分
7楼-- · 2019-05-14 21:19

What is the score if you move the few lines of your calc function inside of your list iteration?
I know it's not very clean, but you'll gain over the call stack.

[...]
    for (int k = 0; k < list.length; k++) {
        int a1 = list[k++];
        int a2 = list[k++];
        int b1 = list[k++];
        int b2 = list[k];

        int c1 = (a1 + a2) ^ a2;
        int c2 = (b1 - b2) << 4;

        for (int z = 0; z < 100; z++) {
            c1 ^= z + c2;
        }

        rs = c1;
    }
查看更多
登录 后发表回答