Experiencing occasional long garbage collection de

2019-03-12 22:04发布

站内文章 / Java

36 0

男人必须洒脱

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm having a hard time dealing with a Java garbage collection problem, and interpreting the logs.

My application requires that no GC takes longer than 2 seconds, and ideally less than 100ms.

Based on some previous advice I'm trying the following command line options:

 java -XX:MaxGCPauseMillis=100 -XX:NewRatio=9 -XX:+UseConcMarkSweepGC  -XX:+CMSIncrementalMode -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -server -Xmx9g -Xms9g

The application has a large amount of long-term stored objects, which are kept in a ConcurrentLinkedHashMap. I'm seeing occasional long pauses, in the worst case up to 10 seconds (this is the second last like of the GC logs below)!

Here is some of the output I'm getting:

16938.968: [GC 16938.968: [ParNew: 153343K->17022K(153344K), 7.8608580 secs] 6184328K->6122510K(9420160K) icms_dc=7 , 7.8614100 secs] [Times: user=0.63 sys=0.01, real=7.86 secs] 
16947.087: [GC 16947.087: [ParNew: 153342K->17022K(153344K), 7.2604030 secs] 6258830K->6198642K(9420160K) icms_dc=7 , 7.2609780 secs] [Times: user=0.44 sys=0.00, real=7.27 secs] 
16954.614: [GC 16954.614: [ParNew: 153342K->17024K(153344K), 8.4307620 secs] 6334962K->6274625K(9420160K) icms_dc=7 , 8.4313150 secs] [Times: user=0.62 sys=0.01, real=8.43 secs] 
16963.310: [GC 16963.310: [ParNew: 153344K->17023K(153344K), 6.2588760 secs] 6410945K->6350748K(9420160K) icms_dc=7 , 6.2594290 secs] [Times: user=0.48 sys=0.01, real=6.25 secs] 
16969.834: [GC 16969.834: [ParNew: 153343K->17022K(153344K), 6.0274280 secs] 6487068K->6425868K(9420160K) icms_dc=7 , 6.0279830 secs] [Times: user=0.50 sys=0.01, real=6.03 secs] 
16976.122: [GC 16976.123: [ParNew: 153342K->17022K(153344K), 11.7774620 secs] 6562188K->6503030K(9420160K) icms_dc=7 , 11.7780180 secs] [Times: user=0.43 sys=0.04, real=11.78 secs] 
16988.164: [GC 16988.164: [ParNew: 153342K->17024K(153344K), 10.9477920 secs] 6639350K->6579928K(9420160K) icms_dc=7 , 10.9483440 secs] [Times: user=0.37 sys=0.02, real=10.95 secs] 
16999.371: [GC 16999.372: [ParNew: 153344K->17023K(153344K), 9.8828360 secs] 6716248K->6655886K(9420160K) icms_dc=7 , 9.8833940 secs] [Times: user=0.42 sys=0.01, real=9.88 secs] 
17009.509: [GC 17009.509: [ParNew: 153343K->17023K(153344K), 5.0699960 secs] 6792206K->6727987K(9420160K) icms_dc=7 , 5.0705660 secs] [Times: user=0.55 sys=0.01, real=5.07 secs] 
17014.838: [GC 17014.838: [ParNew: 153343K->17023K(153344K), 6.6411750 secs] 6864307K->6790974K(9420160K) icms_dc=7 , 6.6417400 secs] [Times: user=0.37 sys=0.01, real=6.63 secs] 
17021.735: [GC 17021.735: [ParNew: 153343K->17024K(153344K), 8.0545970 secs] 6927294K->6856409K(9420160K) icms_dc=7 , 8.0551790 secs] [Times: user=0.34 sys=0.03, real=8.05 secs] 
17030.052: [GC 17030.053: [ParNew: 153344K->17023K(153344K), 7.9756730 secs] 6992729K->6922569K(9420160K) icms_dc=7 , 7.9762530 secs] [Times: user=0.34 sys=0.01, real=7.98 secs] 
17038.398: [GC 17038.398: [ParNew: 153343K->17022K(153344K), 12.9613300 secs] 7058889K->6990725K(9420160K) icms_dc=7 , 12.9618850 secs] [Times: user=0.39 sys=0.01, real=12.96 secs] 
17051.630: [GC 17051.630: [ParNew: 153342K->17022K(153344K), 6.8942910 secs] 7127045K->7059607K(9420160K) icms_dc=7 , 6.8948380 secs] [Times: user=0.56 sys=0.02, real=6.89 secs] 
17058.798: [GC 17058.798: [ParNew: 153342K->17024K(153344K), 10.0262190 secs] 7195927K->7126351K(9420160K) icms_dc=7 , 10.0267860 secs] [Times: user=0.37 sys=0.01, real=10.02 secs] 
17069.096: [GC 17069.096: [ParNew: 153344K->17023K(153344K), 10.0419500 secs] 7262671K->7195002K(9420160K) icms_dc=7 , 10.0425020 secs] [Times: user=0.40 sys=0.02, real=10.04 secs] 
17079.410: [GC 17079.410: [ParNew: 153343K->17022K(153344K), 13.5389040 secs] 7331322K->7264275K(9420160K) icms_dc=7 , 13.5394610 secs] [Times: user=0.30 sys=0.01, real=13.54 secs] 
17093.223: [GC 17093.224: [ParNew: 153342K->17023K(153344K), 10.5909450 secs] 7400595K->7330446K(9420160K) icms_dc=7 , 10.5915060 secs] [Times: user=0.33 sys=0.00, real=10.58 secs] 
17104.083: [GC 17104.084: [ParNew: 153343K->17024K(153344K), 5.8420210 secs] 7466766K->7392173K(9420160K) icms_dc=7 , 5.8425920 secs] [Times: user=0.57 sys=0.00, real=5.84 secs]

I've spent hours pouring over the various webpages that describe Java GC tuning, but none have really given me the ability to interpret the logs above and come up with a course of action. Any specific advice based on the logs I've provided would be greatly appreciated.

Update: Per a question below:

The machine has 16G of RAM, here is the info from top: Mem: 15483904k total, 15280084k used, 203820k free, 155684k buffers Swap: 2031608k total, 1347240k used, 684368k free, 3304044k cached

Its a different run, but here is the current top output for the process:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  1016 sanity   17   0 10.2g 6.5g 9464 S    1 44.2  10:24.32 java

Update 2: Some more detailed logging, this looked like it caused a 400ms delay:

{Heap before GC invocations=1331 (full 1):
 par new generation   total 153344K, used 153343K [0x00002aaaae200000, 0x00002aaab8860000, 0x00002aaab8860000)
  eden space 136320K, 100% used [0x00002aaaae200000, 0x00002aaab6720000, 0x00002aaab6720000)
  from space 17024K,  99% used [0x00002aaab77c0000, 0x00002aaab885fff0, 0x00002aaab8860000)
  to   space 17024K,   0% used [0x00002aaab6720000, 0x00002aaab6720000, 0x00002aaab77c0000)
 concurrent mark-sweep generation total 7169664K, used 4258496K [0x00002aaab8860000, 0x00002aac6e200000, 0x00002aac6e200000)
 concurrent-mark-sweep perm gen total 21248K, used 13269K [0x00002aac6e200000, 0x00002aac6f6c0000, 0x00002aac73600000)
484.738: [GC 484.738: [ParNew: 153343K->17022K(153344K), 0.3950480 secs] 4411840K->4341689K(7323008K), 0.3954820 secs] [Times: user=0.49 sys=0.07, real=0.40 secs] 
Heap after GC invocations=1332 (full 1):
 par new generation   total 153344K, used 17022K [0x00002aaaae200000, 0x00002aaab8860000, 0x00002aaab8860000)
  eden space 136320K,   0% used [0x00002aaaae200000, 0x00002aaaae200000, 0x00002aaab6720000)
  from space 17024K,  99% used [0x00002aaab6720000, 0x00002aaab77bfb68, 0x00002aaab77c0000)
  to   space 17024K,   0% used [0x00002aaab77c0000, 0x00002aaab77c0000, 0x00002aaab8860000)
 concurrent mark-sweep generation total 7169664K, used 4324666K [0x00002aaab8860000, 0x00002aac6e200000, 0x00002aac6e200000)
 concurrent-mark-sweep perm gen total 21248K, used 13269K [0x00002aac6e200000, 0x00002aac6f6c0000, 0x00002aac73600000)
}

回答1:

Turns out the problem was that the heap was getting swapped out to disk, and the delay was Java's GC having to wait for it to be swapped back in.

Solved (mostly) by setting Linux' "swappiness" parameter to 0.

回答2:

From the times, it seems the GC actually doesn't run all the time (see the user times), so most of the time, it's waiting.

Just a wild guess: Doesn't it swap? How much memory does the machine have? How much of it does the java process get (resident set size)?

Edit: why is it waiting:

Look at this (from your transcript)

[Times: user=0.39 sys=0.01, real=12.96 secs]

This means that (I guess from the beginning to the end of GC), 12 (almost 13) seconds passed. Of these 13 seconds, .39 was spent running in user mode, .01 was spent running in kernel mode. If the time collection method isn't totally flawed (ie. the numbers really represent the running time of the GC process/thread), this means at least 12 seconds waiting.

回答3:

In general, it's tough to get GC tuning right when you require such a large heap.

That being said, most of GC hang time is caused by the memory copying of objects that survive across young garbage collections.

Is your ConcurrentLinkedHashMap initialized with all the memory-persistent objects all at once? Or does it slowly grow bigger as the application keeps running? If it's the latter, it may be difficult to cut down your GC hang times since there are objects that always survive. If it's the former, you will need to size your tenured generation with the size of your persistent objects + 20% or so, and make sure that your young gen is large enough to keep up with the transient objects that are created during the course of your application.

回答4:

Perhaps the performance limit of 200ms is too strict and you need to manage garbage collection on your own? have you tried this with a larger limit?

回答5:

I think you are having this UseConcMarkSweepGC and NewRatio bug. Since your new-space is no way near being one tenth of the -Jmx=9G. The bug includes a workaround (NewSize in absolute size).

Another flag that might be very important for you is CMSInitiatingOccupancyFraction. It is set at 92% in java6 and was 68% in java5. If the old-space grows larger the CMS threadpool will start to do their work. If you have CPU to spend, it is no danger to have a live-set that is above the initiating fraction.

It would be nice if you had included the GC stats after you fixed the memory paging problem.

回答6:

If you have strict timing requirements, perhaps you should check into the Java Real-Time System.

RTSJ /Java RTS provides:

An API set, semantic Java VM enhancements, and JVM-to-OS layer modifications which allow Java developers to correctly reason about and control the temporal behavior of Java applications.

回答7:

Have you run your app through a profiler to see that what you think is happening with respect to memeory is what is actually happening?

One thing I do when I am looking into this is use the Netbeans profiler (though any profiler should give this to you) is look at the live bytes (the memory allocation) and see if the ones that have a large amount of allocated bytes and allocated objectd are what I am expecting and are coming from where I expect.

You can also probably use the profiler to look at the effect your tuning is having, run the app without any tuing args, and then add your tuning args and run it again and see what is going on with the memory.

回答8:

It can be very hard to tell without actually seeing, and in some cases profiling the code.

Have you implemented finalize() for any of your objects? That will cause a big GC penalty. It would also be interesting to see a test run with a heap of maybe 6 Gigs, if you get a disproportionate improvement in performance it would indicate that the GC is thrashing for memory.

回答9:

Can you post/link to the code of the ConcurrentLinkedHashMap implementation? If this is the implementation I've posted, please open a ticket on the project page so we can debug it together. If not, knowing the details of your implementation would help to determine where the issue might lie.

回答10:

9 GB JVM ! Never seen that before ! I think your 10 sec. pauses are quite normal. see this (maybe you've read it already...)

回答11:

for me problem was full survivor spaces. As CMS isn't compactional, garbage was promoted directly to old gen. this could be very expensive with such large heap. It's necessary to increase Survivor spaces and MaxTenuringThreshold to avoid promotion as much as possible.

regards, Alex

回答12:

I think your attention might be a little misdirected.

Spend a little time in a profiler finding your biggest allocation hotspots. If there are just a few places in your code where the majority of your allocations occur, try using an object pool rather than always constructing new objects.

Collection classes and StringBuilders are great candidates for pooling. When you return them to the pool, call the collection.clear() or the stringbuilder.setLength(0) methods, so that they're ready for consumption when the next caller wants to retrieve them from the pool.

The best way to tune the GC is by creating fewer objects. There are lots of strategies for eliminating allocations, and pooling is just one of them (though one of my favorites).

UPDATE: It's been five years since I wrote this answer, and my opinion on pooling has mostly changed. Back when I wrote this answer in 2009, I could frequently use object pooling (even of simple objects like StringBuilder) to speed up tight inner loops with lots of allocations. These days, it's harder to find cases where pooling doesn't make the situation worse. I almost never use pools for anything other than threads or connections. Still though, it's a nice tool to have at your disposal, even if you don't use it often.