Why am I experiencing long non-GC related applicat

2019-04-09 20:08发布

I have been seeing extremely long application pauses in the GC logs that are seemingly not related to garbage collection. The application is on a 16GB heap and allocated memory throughput is quite large. I understand that the GC logs will show pauses that aren't necessarily related to GC and that seems to be the case here.

Also, during these pauses the CPU hits 100%, there is no disk or memory I/O and there is no logging activity (due to all the application threads being stopped).

Any ideas on what may be causing these long pauses, or any suggestions on how I can track the root cause down?

Here is the relevant snippet from the logs (notice the 151 second pause):

2015-04-21T06:30:04.834-0500: 217179.483: [GC (Allocation Failure)2015-04-21T06:30:04.849-0500: 217179.483: [ParNew
Desired survivor size 43614208 bytes, new threshold 6 (max 6)
- age   1:   19876896 bytes,   19876896 total
- age   2:   15903904 bytes,   35780800 total
- age   3:     617120 bytes,   36397920 total
- age   4:      34096 bytes,   36432016 total
- age   5:     399896 bytes,   36831912 total
- age   6:      71112 bytes,   36903024 total
: 706301K->45145K(766784K), 0.0681632 secs] 19029029K->18368371K(25080640K), 0.0684183 secs] [Times: user=0.44 sys=0.00, real=0.08 secs] 
2015-04-21T06:30:05.816-0500: 217180.453: Total time for which application threads were stopped: 1.8507910 seconds
2015-04-21T06:30:09.171-0500: 217183.812: [GC (Allocation Failure)2015-04-21T06:30:09.171-0500: 217183.812: [ParNew
Desired survivor size 43614208 bytes, new threshold 6 (max 6)
- age   1:   14812976 bytes,   14812976 total
- age   2:    3411968 bytes,   18224944 total
- age   3:   14952032 bytes,   33176976 total
- age   4:     430048 bytes,   33607024 total
- age   5:      32864 bytes,   33639888 total
- age   6:     374880 bytes,   34014768 total
: 726745K->43352K(766784K), 0.0673428 secs] 19049971K->18366645K(25080640K), 0.0675792 secs] [Times: user=0.48 sys=0.00, real=0.06 secs] 
2015-04-21T06:30:09.249-0500: 217183.886: Total time for which application threads were stopped: 0.0763406 seconds
2015-04-21T06:32:43.960-0500: 217338.603: Total time for which application threads were stopped: 151.1866188 seconds
2015-04-21T06:32:43.975-0500: 217338.607: [GC (GCLocker Initiated GC)2015-04-21T06:32:43.975-0500: 217338.607: [ParNew
Desired survivor size 43614208 bytes, new threshold 4 (max 6)
- age   1:   18173880 bytes,   18173880 total
- age   2:    9659800 bytes,   27833680 total
- age   3:    3400176 bytes,   31233856 total
- age   4:   14928944 bytes,   46162800 total
- age   5:     375056 bytes,   46537856 total
- age   6:      31880 bytes,   46569736 total
: 727682K->56678K(766784K), 0.0759093 secs] 19050975K->18380341K(25080640K), 0.0760795 secs] [Times: user=0.58 sys=0.00, real=0.08 secs] 
2015-04-21T06:32:44.053-0500: 217338.684: Total time for which application threads were stopped: 0.0769955 seconds

Here are my JVM settings at the moment:

-Xmx16g
-XX:MaxPermSize=256m
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=true
-Dcom.sun.management.jmxremote.login.config=JmxFileLoginModule
-Dcom.sun.management.jmxremote.port=8082
-Dcom.sun.management.jmxremote.ssl=false
-Dfile.encoding=UTF8
-Djava.awt.headless=true
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-Xloggc:{fileLoc}
-XX:+DisableExplicitGC
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCCause
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintTenuringDistribution
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:GCLogFileSize=50M
-XX:NumberOfGCLogFiles=10
-XX:ReservedCodeCacheSize=256M
-XX:-UseGCLogFileRotation
-XX:+PrintSafepointStatistics
-XX:+SafepointTimeout
-XX:SafepointTimeoutDelay=4000
-XX:+PrintSafepointStatistics

UPDATE:

I forgot to include my safepoint logging here. This is what my safepoint log looks like for the 151 second pause experienced above:

217338.603: [sweeping nmethods, 0.0000036 secs]
             vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
    217187.422: GenCollectForAllocation          [     140         46              1    ]      [     1     0151184     2     0    ]  0

The block time is 0, but it appears right next to the sync time, so the sync time is actually 151184.

For other GC-related pauses the safepoint log appears like this:

107100.562: [sweeping nmethods, 0.0000015 secs]
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
107100.563: GenCollectForAllocation          [     117          0              0    ]      [     0     0     0     0 38238    ]  0

So, the GC pause shows as vmop time and the non-GC pauses show as sync time. I'll have to read more about what this means.

2条回答
时光不老,我们不散
2楼-- · 2019-04-09 20:22

We've experienced a similar problem that appeared to be a JVM bug JDK-8041984.
I'm not sure this is your case as well, but consider updating to JDK 8u45 or 7u80.

Whenever you see this problem again try to take a forced stack dump with the native frames during this long pause:

jstack -F -m <PID>

This will help to identify what the VM is doing even if the process is not responding.

查看更多
乱世女痞
3楼-- · 2019-04-09 20:31

The block time is 0, but it appears right next to the sync time, so the sync time is actually 151184.

You're reading the output wrong, due to tabs the headers are not properly aligned, you need to count columns.

As I understand it it's basically waiting 151 seconds to reach a safepoint because a mutator-thread remains a live despite a safepoint trap being active.

This can happen either when it spends too much time in native code or as @apangin mentions when compiler threads do excessive amounts of work.

I think there were also some cases where the compiler would optimize away a safepoint in a loop backedge when it could prove that the loop finishes in a finite amount of time, which could take several seconds in loops iterating over long counters.

查看更多
登录 后发表回答