Java slower with big heap

2020-07-18 07:03发布

问题:

I have a Java program that operates on a (large) graph. Thus, it uses a significant amount of heap space (~50GB, which is about 25% of the physical memory on the host machine). At one point, the program (repeatedly) picks one node from the graph and does some computation with it. For some nodes, this computation takes much longer than anticipated (30-60 minutes, instead of an expected few seconds). In order to profile these opertations to find out what takes so much time, I have created a test program that creates only a very small part of the large graph and then runs the same operation on one of the nodes that took very long to compute in the original program. Thus, the test program obviously only uses very little heap space, compared to the original program.

It turns out that an operation that took 48 minutes in the original program can be done in 9 seconds in the test program. This really confuses me. The first thought might be that the larger program spends a lot of time on garbage collection. So I turned on the verbose mode of the VM's garbage collector. According to that, no full garbage collections are performed during the 48 minutes, and only about 20 collections in the young generation, which each take less than 1 second.

So my questions is what else could there be that explains such a huge difference in timing? I don't know much about how Java internally organizes the heap. Is there something that takes significantly longer for a large heap with a large number of live objects? Could it be that object allocation takes much longer in such a setting, because it takes longer to find an adequate place in the heap? Or does the VM do any internal reorganization of the heap that might take a lot of time (besides garbage collection, obviously).

I am using Oracle JDK 1.7, if that's of any importance.

回答1:

While bigger memory might mean bigger problems, I'd say there's nothing (except the GC which you've excluded) what could extend 9 seconds to 48 minutes (factor 320).

A big heap makes seemingly worse spatial locality possible, but I don't think it matters. I disagree with Tim's answer w.r.t. "having to leave the cache for everything".

There's also the TLB which a cache for the virtual address translation, which could cause some problems with very large memory. But again, not factor 320.

I don't think there's anything in the JVM which could cause such problems.

The only reason I can imagine is that you have some swap space which gets used - despite the fact that you have enough physical memory. Even slight swapping can be the cause for a huge slowdown. Make sure it's off (and possibly check swappiness).



回答2:

Even when things are in memory you have multiple levels of caching of data on modern CPUs. Every time you leave the cache to fetch data the slower that will go. Having 50GB of ram could well mean that it is having to leave the cache for everything.

The symptoms and differences you describe are just massive though and I don't see something as simple as cache coherency making that much difference.

The best advice I can five you is to try running a profiler against it both when it's running slow and when it's running fast and compare the difference.

You need solid numbers and timings. "In this environment doing X took Y time". From that you can start narrowing things down.