Java 1.8 safepoint timeout

2020-07-18 09:23发布

问题:

I seem to be running into a scenario where the JVM is stuck indefinitely in trying to get to a safe point after a few hours. However, if I do a jstack with -F option it seems to get out of that wait and continues with the execution.

jdk1.8.0_45/bin/jstack -F 39924 >a.out

I am using jdk1.8.0_45 on Centos

My questions are:

i) It seems that the JVM can come out of that safepoint indefinite wait when sent an interrupt from jstack. How come it doesnt come out without jstack. Is there some jvm option I can use to avoid that indefinite wait.

ii) Can I get a more definite thread dump of the the thread that's causing the issue. The output from the safepoint log seems imprecise.

The options I am using are:.

-server
-XX:+AggressiveOpts
-XX:+UseG1GC
-XX:+UnlockExperimentalVMOptions
-XX:G1MixedGCLiveThresholdPercent=85
-XX:InitiatingHeapOccupancyPercent=30
-XX:G1HeapWastePercent=5 
-XX:MaxGCPauseMillis=1000
-XX:G1HeapRegionSize=4M
-XX:+PrintGC
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps
-XX:+UnlockExperimentalVMOptions
-XX:G1LogLevel=finest
-Xmx6000m
-Xdebug
-Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=999
-XX:+SafepointTimeout
-XX:+UnlockDiagnosticVMOptions
-XX:SafepointTimeoutDelay=20000
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1 

The safepoint log

vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
17771.115: G1IncCollectionPause             [     170          0              0    ]      [     0     0     0     0     8    ]  0

vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
17771.125: RevokeBias                       [     170          1              2    ]      [     0     0     0     0     0    ]  0

vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
17771.127: RevokeBias                       [     170          1              1    ]      [     0     0     0     0     0    ]  0

vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
17771.131: RevokeBias                       [     170          1              2    ]      [     0     0     0     0     0    ]  0

vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
17771.955: RevokeBias                       [     169          0              2    ]      [     0     0     0     0     0    ]  0

vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
17772.160: BulkRevokeBias                   [     171          0              2    ]      [     0     0     0     0     0    ]  0

vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
17772.352: RevokeBias                       [     170          1              3    ]      [     0     0     0     0     0    ]  0

vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
17773.596: RevokeBias                       [     169          0              1    ]      [     0     0     0     0     0    ]  0

 # SafepointSynchronize::begin: Timeout detected:
 # SafepointSynchronize::begin: Timed out while spinning to reach a safepoint.
 # SafepointSynchronize::begin: Threads which did not reach the safepoint:
 # "Thread-14" #115 prio=5 os_prio=0 tid=0x00007f20c8029000 nid=0x9cd0 runnable [0x0000000000000000]    java.lang.Thread.State: RUNNABLE
 # SafepointSynchronize::begin: (End of list)

After the jstack interrupt this is what I see from the safepoint log

vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
17779.826: G1IncCollectionPause             [     169          1              1    ]      [3315603     03315603     0     8    ]  1

vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
21095.439: RevokeBias                       [     169          2             13    ]      [     0     0     0     0     0    ]  0

vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
21095.439: RevokeBias                       [     169          1              2    ]      [     0     0     0     0     0    ]  0

vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
21095.441: RevokeBias                       [     184          3              4    ]      [     0     0     3     0     1    ]  0

vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
21095.447: RevokeBias                       [     190          0              2    ]      [     0     0     4     0     2    ]  0

回答1:

Since you are able to remedy the problem by interrupting the VM and you're on CentOS the problem reminds me of this kernel bug.

The thread lists the following affected versions (assuming standard kernels):

  • RHEL 6 (and CentOS 6, and SL 6): 6.0-6.5 are good. 6.6 is BAD. 6.6.z is good.
  • RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD. As of yesterday. there does not yet appear to be a 7.x fix.
  • RHEL 5 (and CentOS 5, and SL 5): All versions are good (including 5.11).